Geneformer

Maintainer: ctheodoris

Total Score

158

Last updated 5/28/2024

🚀

PropertyValue
Model LinkView on HuggingFace
API SpecView on HuggingFace
Github LinkNo Github link provided
Paper LinkNo paper link provided

Create account to get full access

or

If you already have an account, we'll log you in

Model overview

Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes. The model was developed by ctheodoris to enable context-aware predictions in settings with limited data in network biology. Geneformer uses a rank value encoding to represent each cell's transcriptome, which deprioritizes ubiquitously highly-expressed genes and prioritizes genes that distinguish cell state. This self-supervised pretraining approach allows the model to gain a fundamental understanding of network dynamics in a completely self-supervised manner.

Model inputs and outputs

Geneformer takes as input the rank value encoding of a single cell's transcriptome, and outputs predictions for masked genes within that cell state, using the context of the remaining unmasked genes. This allows the model to learn the relationships between genes and their expression patterns across different cell types and states.

Inputs

  • Rank value encoding of a single cell's transcriptome

Outputs

  • Predicted gene identities for masked positions in the input transcriptome

Capabilities

Geneformer has gained a deep understanding of biological network dynamics through its self-supervised pretraining on a large corpus of single cell data. This allows the model to make context-aware predictions that can be useful for a variety of network biology applications, even in settings with limited labeled data.

What can I use it for?

The Geneformer model can be fine-tuned for various tasks in network biology, such as gene function prediction, cell type classification, and drug target identification. By leveraging the model's inherent understanding of gene expression patterns and their relationships, researchers can develop powerful predictive models even when working with limited labeled data. Additionally, the Genecorpus-30M pretraining dataset could be a valuable resource for other researchers working on similar problems in the field of single cell biology.

Things to try

One interesting aspect of Geneformer is its use of a rank value encoding to represent each cell's transcriptome. This nonparametric approach may be more robust to technical artifacts that can bias the absolute transcript counts, while still preserving the relative ranking of genes that distinguish cell state. Researchers could explore how this rank-based representation affects the model's performance and interpretability compared to more traditional approaches.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🔄

ProtGPT2

nferruz

Total Score

83

ProtGPT2 is a large language model that has been trained on protein sequences, enabling it to "speak the protein language" and generating novel protein sequences that conserve the critical features of natural proteins. It is based on the GPT-2 transformer architecture and contains 36 layers with 1280 model dimensions, totaling 738 million parameters. ProtGPT2 was pre-trained on the UniRef50 protein sequence database in a self-supervised fashion, learning to predict the next amino acid in a sequence. This allows the model to capture the underlying "grammar" of protein structures and sequences. Similar models like DistilGPT2 and GPT-2B-001 also utilize transformer architectures, but are trained on different datasets and for different purposes. ProtGPT2 is uniquely focused on the protein domain, while the others are more general-purpose language models. Model inputs and outputs Inputs Protein sequences**: ProtGPT2 takes protein sequences as input, which are represented as a sequence of amino acid tokens. The model can accept sequences of varying lengths. Outputs Protein sequences**: Given a starting token or sequence, ProtGPT2 can generate novel protein sequences that maintain the statistical properties of natural proteins, such as amino acid propensities, secondary structure content, and globularity. Capabilities ProtGPT2 excels at generating de novo protein sequences that conserve the key features of natural proteins. By learning the underlying "grammar" of the protein language, the model can explore unseen regions of the protein sequence space in a principled way. This makes ProtGPT2 a powerful tool for protein design and engineering, as the generated sequences can serve as starting points for further optimization and testing. What can I use it for? ProtGPT2 can be used for a variety of protein-related tasks, such as: De novo protein design**: Generate novel protein sequences with desired properties for applications in biotechnology, medicine, and materials science. Protein engineering**: Use the model to explore sequence space and identify starting points for further optimization of existing proteins. Protein feature extraction**: Leverage the model's learned representations to extract useful features of protein sequences for downstream tasks like structure prediction or function annotation. The maintainer, nferruz, provides detailed instructions on how to use ProtGPT2 with the Hugging Face Transformers library. Things to try One interesting aspect of ProtGPT2 is its ability to generate sequences that maintain the statistical properties of natural proteins, while exploring previously unseen regions of the protein sequence space. Researchers can experiment with using the model to generate diverse sets of candidate proteins for various applications, and then analyze the generated sequences to gain insights into the "language of life" encoded in protein structures. Additionally, the model's performance on downstream tasks like structure prediction and function annotation can be further explored, as the learned representations may capture meaningful biophysical and structural features of proteins. Verifying all URLs: All links provided in the prompt are contained within this response, and the writing is in a clear, non-repetitive, natural style.

Read more

Updated Invalid Date

🧠

gpt2

openai-community

Total Score

2.0K

gpt2 is a transformer-based language model created and released by OpenAI. It is the smallest version of the GPT-2 model, with 124 million parameters. Like other GPT-2 models, gpt2 is a causal language model pretrained on a large corpus of English text using a self-supervised objective to predict the next token in a sequence. This allows the model to learn a general understanding of the English language that can be leveraged for a variety of downstream tasks. The gpt2 model is related to larger GPT-2 variations such as GPT2-Large, GPT2-Medium, and GPT2-XL, which have 355 million, 774 million, and 1.5 billion parameters respectively. These larger models were also developed and released by the OpenAI community. Model inputs and outputs Inputs Text sequence**: The model takes a sequence of text as input, which it uses to generate additional text. Outputs Generated text**: The model outputs a continuation of the input text sequence, generating new text one token at a time in an autoregressive fashion. Capabilities The gpt2 model is capable of generating fluent, coherent text in English on a wide variety of topics. It can be used for tasks like creative writing, text summarization, and language modeling. However, as the OpenAI team notes, the model does not distinguish fact from fiction, so it should not be used for applications that require the generated text to be truthful. What can I use it for? The gpt2 model can be used for a variety of text generation tasks. Researchers may use it to better understand the behaviors, capabilities, and biases of large-scale language models. The model could also be fine-tuned for applications like grammar assistance, auto-completion, creative writing, and chatbots. However, users should be aware of the model's limitations and potential for biased or harmful output, as discussed in the OpenAI model card. Things to try One interesting aspect of the gpt2 model is its ability to generate diverse and creative text from a given prompt. You can experiment with providing the model with different types of starting prompts, such as the beginning of a story, a description of a scene, or even a single word, and see what kind of coherent and imaginative text it generates in response. Additionally, you can try fine-tuning the model on a specific domain or task to see how its performance and output changes compared to the base model.

Read more

Updated Invalid Date

👁️

graphormer-base-pcqm4mv2

clefourrier

Total Score

54

The graphormer-base-pcqm4mv2 is a graph classification model developed by Microsoft. It is a Graphormer, a type of graph Transformer model, that was pretrained on the PCQM4M-LSCv2 dataset. The Graphormer is an alternative to traditional graph models and large language models, providing a practical solution for graph-related tasks. Similar models include Geneformer, a foundation transformer model pretrained on 30 million single cell transcriptomes to enable context-aware predictions in network biology tasks. Model inputs and outputs Inputs Graph data**: The model takes graph-structured data as input, such as molecular graphs or other relational data. Outputs Graph classification**: The primary output of the model is a classification of the input graph, such as predicting the property of a molecule. Capabilities The graphormer-base-pcqm4mv2 model can be used for a variety of graph classification tasks, particularly those related to molecule modeling. It can handle large graphs without running into memory issues, making it a practical solution for real-world applications. What can I use it for? The graphormer-base-pcqm4mv2 model can be used directly for graph classification tasks or fine-tuned on downstream tasks. Potential use cases include: Molecular property prediction Chemical reaction prediction Drug discovery Material design Social network analysis Knowledge graph reasoning Things to try One key aspect of the graphormer-base-pcqm4mv2 model is its ability to handle large graphs efficiently. Developers can experiment with using the model on various graph-structured datasets to see how it performs compared to traditional graph models or large language models. Additionally, fine-tuning the model on specific domains or tasks can unlock new capabilities and insights.

Read more

Updated Invalid Date

🔗

biogpt

microsoft

Total Score

205

biogpt is a domain-specific generative transformer language model pre-trained on large-scale biomedical literature by researchers at Microsoft. It was developed to address the lack of generation ability in other popular biomedical language models like BioBERT and PubMedBERT, which are constrained to discriminative downstream tasks. In contrast, biogpt demonstrates strong performance on a variety of biomedical natural language processing tasks, including relation extraction and question answering. Model inputs and outputs Inputs Raw text data in the biomedical domain, such as research abstracts or papers Outputs Automatically generated text in the biomedical domain, such as descriptions of biomedical terms or concepts Embeddings and representations of biomedical text that can be used for downstream tasks Capabilities biogpt can be used to generate fluent, coherent text in the biomedical domain. For example, when prompted with "COVID-19 is", the model can generate relevant and informative continuations like "COVID-19 is a disease that spreads worldwide and is currently found in a growing proportion of the population" or "COVID-19 is transmitted via droplets, air-borne, or airborne transmission." What can I use it for? biogpt can be used for a variety of biomedical NLP applications, such as: Biomedical text generation: Automatically generating descriptions, summaries, or explanations of biomedical concepts and findings. Downstream biomedical tasks: Fine-tuning biogpt on specific tasks like relation extraction, question answering, or biomedical text classification. Biomedical text understanding: Using biogpt embeddings as input features for downstream biomedical ML models. Things to try One interesting aspect of biogpt is its strong performance on biomedical relation extraction tasks, achieving over 40% F1 score on benchmarks like BC5CDR and KD-DTI. Researchers could explore using biogpt as a starting point for building more advanced biomedical information extraction systems, leveraging its ability to understand complex biomedical relationships and entities.

Read more

Updated Invalid Date