[](#protbert-model)ProtBert model
=================================

Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in [this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in [this repository](https://github.com/agemagician/ProtTrans). This model is trained on uppercase amino acids: it only works with capital letter amino acids.

[](#model-description)Model description
---------------------------------------

ProtBert is based on Bert model which pretrained on a large corpus of protein sequences in a self-supervised fashion. This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those protein sequences.

One important difference between our Bert model and the original Bert version is the way of dealing with sequences as separate documents. This means the Next sentence prediction is not used, as each sequence is treated as a complete document. The masking follows the original Bert training with randomly masks 15% of the amino acids in the input.

At the end, the feature extracted from this model revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

The model could be used for protein feature extraction or to be fine-tuned on downstream tasks. We have noticed in some tasks you could gain more accuracy by fine-tuning the model rather than using it as a feature extractor.

### [](#how-to-use)How to use

You can use this model directly with a pipeline for masked language modeling:

    >>> from transformers import BertForMaskedLM, BertTokenizer, pipeline
    >>> tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False )
    >>> model = BertForMaskedLM.from_pretrained("Rostlab/prot_bert")
    >>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
    >>> unmasker('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T')
    
    [{'score': 0.11088453233242035,
      'sequence': '[CLS] D L I P T S S K L V V L D T S L Q V K K A F F A L V T [SEP]',
      'token': 5,
      'token_str': 'L'},
     {'score': 0.08402521163225174,
      'sequence': '[CLS] D L I P T S S K L V V S D T S L Q V K K A F F A L V T [SEP]',
      'token': 10,
      'token_str': 'S'},
     {'score': 0.07328339666128159,
      'sequence': '[CLS] D L I P T S S K L V V V D T S L Q V K K A F F A L V T [SEP]',
      'token': 8,
      'token_str': 'V'},
     {'score': 0.06921856850385666,
      'sequence': '[CLS] D L I P T S S K L V V K D T S L Q V K K A F F A L V T [SEP]',
      'token': 12,
      'token_str': 'K'},
     {'score': 0.06382402777671814,
      'sequence': '[CLS] D L I P T S S K L V V I D T S L Q V K K A F F A L V T [SEP]',
      'token': 11,
      'token_str': 'I'}]
    

Here is how to use this model to get the features of a given protein sequence in PyTorch:

    from transformers import BertModel, BertTokenizer
    import re
    tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False )
    model = BertModel.from_pretrained("Rostlab/prot_bert")
    sequence_Example = "A E T C Z A O"
    sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example)
    encoded_input = tokenizer(sequence_Example, return_tensors='pt')
    output = model(**encoded_input)
    

[](#training-data)Training data
-------------------------------

The ProtBert model was pretrained on [Uniref100](https://www.uniprot.org/downloads), a dataset consisting of 217 million protein sequences.

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21. The rare amino acids "U,Z,O,B" were mapped to "X". The inputs of the model are then of the form:

    [CLS] Protein Sequence A [SEP] Protein Sequence B [SEP]
    

Furthermore, each protein sequence was treated as a separate document. The preprocessing step was performed twice, once for a combined length (2 sequences) of less than 512 amino acids, and another time using a combined length (2 sequences) of less than 2048 amino acids.

The details of the masking procedure for each sequence followed the original Bert model as following:

*   15% of the amino acids are masked.
*   In 80% of the cases, the masked amino acids are replaced by `[MASK]`.
*   In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace.
*   In the 10% remaining cases, the masked amino acids are left as is.

### [](#pretraining)Pretraining

The model was trained on a single TPU Pod V3-512 for 400k steps in total. 300K steps using sequence length 512 (batch size 15k), and 100K steps using sequence length 2048 (batch size 2.5k). The optimizer used is Lamb with a learning rate of 0.002, a weight decay of 0.01, learning rate warmup for 40k steps and linear decay of the learning rate after.

[](#evaluation-results)Evaluation results
-----------------------------------------

When fine-tuned on downstream tasks, this model achieves the following results:

Test results :

Task/Dataset

secondary structure (3-states)

secondary structure (8-states)

Localization

Membrane

CASP12

75

63

TS115

83

72

CB513

81

66

DeepLoc

79

91

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article {Elnaggar2020.07.12.199554,
        author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
        title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
        elocation-id = {2020.07.12.199554},
        year = {2020},
        doi = {10.1101/2020.07.12.199554},
        publisher = {Cold Spring Harbor Laboratory},
        abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
        URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
        eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
        journal = {bioRxiv}
    }
    

> Created by [Ahmed Elnaggar/@Elnaggar\_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)

## Model overview

The `prot_bert` model is a masked language model (MLM) trained on a large corpus of protein sequences. It was developed by the Rostlab team and is based on the BERT architecture, which is known for its strong performance on a variety of natural language processing tasks. Unlike the original BERT model, which was trained on general text data, `prot_bert` was specifically trained on protein sequences, allowing it to capture the unique language and patterns inherent in biological data.

One key difference between `prot_bert` and the standard BERT models is how it handles sequences. Rather than treating each protein sequence as a separate document, `prot_bert` considers the entire sequence as a complete unit, foregoing the next sentence prediction task used in the original BERT. Instead, it focuses solely on the masked language modeling objective, where the model must predict masked amino acids based on the surrounding context.

The [BERT base model (uncased)](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert) and [RoBERTa large model](https://aimodels.fyi/models/huggingFace/roberta-large-facebookai) are two similar transformer-based models that have been pretrained on general text data. While these models can be fine-tuned for various NLP tasks, `prot_bert` is specifically tailored for working with protein sequences and may provide advantages in bioinformatics and computational biology applications.

## Model inputs and outputs

### Inputs
- **Protein sequences**: The `prot_bert` model takes as input protein sequences consisting of uppercase amino acid characters. The model can handle sequences of up to 512 amino acids.

### Outputs
- **Predicted masked amino acids**: Given a protein sequence with 15% of the amino acids masked, the `prot_bert` model outputs the predicted masked amino acids, along with their corresponding scores.

## Capabilities

The `prot_bert` model has demonstrated its ability to capture important biophysical properties of proteins, such as their shape and structure, simply by being trained on unlabeled protein sequences. This suggests that the model has learned some of the underlying "grammar" of the language of life, as realized in protein sequences.

The model can be used for a variety of tasks in computational biology and bioinformatics, such as protein feature extraction or fine-tuning on downstream tasks like protein structure prediction or function annotation. The maintainers have found that in some cases, fine-tuning the model can lead to better performance than using it solely as a feature extractor.

## What can I use it for?

The `prot_bert` model can be a valuable tool for researchers and developers working in the field of computational biology and bioinformatics. By leveraging the model's ability to extract useful features from protein sequences, you can build more accurate and efficient models for tasks like:

- **Protein structure prediction**: Use the model's embeddings as input features to predict the three-dimensional structure of a protein.
- **Protein function annotation**: Fine-tune the model on labeled data to predict the function of a given protein sequence.
- **Protein engineering**: Explore how changes to a protein sequence affect its properties by analyzing the model's predictions.

The [Rostlab team](https://aimodels.fyi/creators/huggingFace/Rostlab) has made the `prot_bert` model available through the Hugging Face model hub, making it easily accessible for researchers and developers to experiment with and integrate into their own projects.

## Things to try

One interesting aspect of the `prot_bert` model is its ability to capture the "grammar" of protein sequences, even without any explicit human labeling. This suggests that the model may be able to uncover novel insights about protein structure and function that are not immediately obvious from the raw sequence data.

Researchers could try fine-tuning the `prot_bert` model on specific protein-related tasks, such as predicting the stability or solubility of a protein, and analyze the model's intermediate representations to gain a better understanding of the underlying biological principles at play. Additionally, the model could be used to generate synthetic protein sequences with desired properties, opening up new possibilities for protein engineering and design.