Geneformer
ctheodoris
Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes. The model was developed by ctheodoris to enable context-aware predictions in settings with limited data in network biology. Geneformer uses a rank value encoding to represent each cell's transcriptome, which deprioritizes ubiquitously highly-expressed genes and prioritizes genes that distinguish cell state. This self-supervised pretraining approach allows the model to gain a fundamental understanding of network dynamics in a completely self-supervised manner.
Model inputs and outputs
Geneformer takes as input the rank value encoding of a single cell's transcriptome, and outputs predictions for masked genes within that cell state, using the context of the remaining unmasked genes. This allows the model to learn the relationships between genes and their expression patterns across different cell types and states.
Inputs
Rank value encoding of a single cell's transcriptome
Outputs
Predicted gene identities for masked positions in the input transcriptome
Capabilities
Geneformer has gained a deep understanding of biological network dynamics through its self-supervised pretraining on a large corpus of single cell data. This allows the model to make context-aware predictions that can be useful for a variety of network biology applications, even in settings with limited labeled data.
What can I use it for?
The Geneformer model can be fine-tuned for various tasks in network biology, such as gene function prediction, cell type classification, and drug target identification. By leveraging the model's inherent understanding of gene expression patterns and their relationships, researchers can develop powerful predictive models even when working with limited labeled data. Additionally, the Genecorpus-30M pretraining dataset could be a valuable resource for other researchers working on similar problems in the field of single cell biology.
Things to try
One interesting aspect of Geneformer is its use of a rank value encoding to represent each cell's transcriptome. This nonparametric approach may be more robust to technical artifacts that can bias the absolute transcript counts, while still preserving the relative ranking of genes that distinguish cell state. Researchers could explore how this rank-based representation affects the model's performance and interpretability compared to more traditional approaches.
Read more