[](#geneformer)Geneformer
=========================

Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.

*   See [our manuscript](https://rdcu.be/ddrx0) for details.
*   See [geneformer.readthedocs.io](https://geneformer.readthedocs.io) for documentation.

[](#model-description)Model Description
=======================================

Geneformer is a foundation transformer model pretrained on [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), a pretraining corpus comprised of ~30 million single cell transcriptomes from a broad range of human tissues. We excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. Each single cells transcriptome is presented to the model as a rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M. The rank value encoding provides a nonparametric representation of that cells transcriptome and takes advantage of the many observations of each genes expression across Genecorpus-30M to prioritize genes that distinguish cell state. Specifically, this method will deprioritize ubiquitously highly-expressed housekeeping genes by normalizing them to a lower rank. Conversely, genes such as transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state will move to a higher rank within the encoding. Furthermore, this rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value while the overall relative ranking of genes within each cell remains more stable.

The rank value encoding of each single cells transcriptome then proceeds through six transformer encoder units. Pretraining was accomplished using a masked learning objective where 15% of the genes within each transcriptome were masked and the model was trained to predict which gene should be within each masked position in that specific cell state using the context of the remaining unmasked genes. A major strength of this approach is that it is entirely self-supervised and can be accomplished on completely unlabeled data, which allows the inclusion of large amounts of training data without being restricted to samples with accompanying labels.

We detail applications and results in [our manuscript](https://rdcu.be/ddrx0).

During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the models attention weights in a completely self-supervised manner. With both zero-shot learning and fine-tuning with limited task-specific data, Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks relevant to chromatin and network dynamics. In silico perturbation with zero-shot learning identified a novel transcription factor in cardiomyocytes that we experimentally validated to be critical to their ability to generate contractile force. In silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that we experimentally validated to significantly improve the ability of cardiomyocytes to generate contractile force in an iPSC model of the disease. Overall, Geneformer represents a foundational deep learning model pretrained on ~30 million human single cell transcriptomes to gain a fundamental understanding of gene network dynamics that can now be democratized to a vast array of downstream tasks to accelerate discovery of key network regulators and candidate therapeutic targets.

In [our manuscript](https://rdcu.be/ddrx0), we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within this repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.

Both the 6 and 12 layer Geneformer models were pretrained in June 2021.

[](#application)Application
===========================

The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.

Example applications demonstrated in [our manuscript](https://rdcu.be/ddrx0) include:

_Fine-tuning_:

*   transcription factor dosage sensitivity
*   chromatin dynamics (bivalently marked promoters)
*   transcription factor regulatory range
*   gene network centrality
*   transcription factor targets
*   cell type annotation
*   batch integration
*   cell state classification across differentiation
*   disease classification
*   in silico perturbation to determine disease-driving genes
*   in silico treatment to determine candidate therapeutic targets

_Zero-shot learning_:

*   batch integration
*   gene context specificity
*   in silico reprogramming
*   in silico differentiation
*   in silico perturbation to determine impact on cell state
*   in silico perturbation to determine transcription factor targets
*   in silico perturbation to determine transcription factor cooperativity

[](#installation)Installation
=============================

In addition to the pretrained model, contained herein are functions for tokenizing and collating data specific to single cell transcriptomics, pretraining the model, fine-tuning the model, extracting and plotting cell embeddings, and performing in silico pertrubation with either the pretrained or fine-tuned models. To install:

    # Make sure you have git-lfs installed (https://git-lfs.com)
    git lfs install
    git clone https://huggingface.co/ctheodoris/Geneformer
    cd Geneformer
    pip install .
    

For usage, see [examples](https://huggingface.co/ctheodoris/Geneformer/tree/main/examples) for:

*   tokenizing transcriptomes
*   pretraining
*   hyperparameter tuning
*   fine-tuning
*   extracting and plotting cell embeddings
*   in silico perturbation

Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the [example\_input\_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files) in the dataset repository, but these only represent a few example fine-tuning applications.

Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).

## Model overview

`Geneformer` is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes. The model was developed by [ctheodoris](https://aimodels.fyi/creators/huggingFace/ctheodoris) to enable context-aware predictions in settings with limited data in network biology. `Geneformer` uses a rank value encoding to represent each cell's transcriptome, which deprioritizes ubiquitously highly-expressed genes and prioritizes genes that distinguish cell state. This self-supervised pretraining approach allows the model to gain a fundamental understanding of network dynamics in a completely self-supervised manner.

## Model inputs and outputs

`Geneformer` takes as input the rank value encoding of a single cell's transcriptome, and outputs predictions for masked genes within that cell state, using the context of the remaining unmasked genes. This allows the model to learn the relationships between genes and their expression patterns across different cell types and states.

### Inputs
- Rank value encoding of a single cell's transcriptome

### Outputs
- Predicted gene identities for masked positions in the input transcriptome

## Capabilities

`Geneformer` has gained a deep understanding of biological network dynamics through its self-supervised pretraining on a large corpus of single cell data. This allows the model to make context-aware predictions that can be useful for a variety of network biology applications, even in settings with limited labeled data.

## What can I use it for?

The [Geneformer](https://geneformer.readthedocs.io) model can be fine-tuned for various tasks in network biology, such as gene function prediction, cell type classification, and drug target identification. By leveraging the model's inherent understanding of gene expression patterns and their relationships, researchers can develop powerful predictive models even when working with limited labeled data. Additionally, the [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M) pretraining dataset could be a valuable resource for other researchers working on similar problems in the field of single cell biology.

## Things to try

One interesting aspect of `Geneformer` is its use of a rank value encoding to represent each cell's transcriptome. This nonparametric approach may be more robust to technical artifacts that can bias the absolute transcript counts, while still preserving the relative ranking of genes that distinguish cell state. Researchers could explore how this rank-based representation affects the model's performance and interpretability compared to more traditional approaches.