[](#scibert)SciBERT
===================

This is the pretrained model presented in [SciBERT: A Pretrained Language Model for Scientific Text](https://www.aclweb.org/anthology/D19-1371/), which is a BERT model trained on scientific text.

The training corpus was papers taken from [Semantic Scholar](https://www.semanticscholar.org). Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.

SciBERT has its own wordpiece vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions.

Available models include:

*   `scibert_scivocab_cased`
*   `scibert_scivocab_uncased`

The original repo can be found [here](https://github.com/allenai/scibert).

If using these models, please cite the following paper:

    @inproceedings{beltagy-etal-2019-scibert,
        title = "SciBERT: A Pretrained Language Model for Scientific Text",
        author = "Beltagy, Iz  and Lo, Kyle  and Cohan, Arman",
        booktitle = "EMNLP",
        year = "2019",
        publisher = "Association for Computational Linguistics",
        url = "https://www.aclweb.org/anthology/D19-1371"
    }

## Model overview

The `scibert_scivocab_uncased` model is a BERT model trained on scientific text, as presented in the paper [SciBERT: A Pretrained Language Model for Scientific Text](https://www.aclweb.org/anthology/D19-1371/). This model was trained on a large corpus of 1.14M scientific papers from [Semantic Scholar](https://www.semanticscholar.org), using the full text of the papers, not just abstracts. Unlike the general-purpose [BERT base models](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert), `scibert_scivocab_uncased` has a specialized vocabulary that is optimized for scientific text.

## Model inputs and outputs

### Inputs
- Uncased text sequences

### Outputs
- Contextual token-level representations
- Sequence-level representations
- Predictions for masked tokens in the input

## Capabilities

The `scibert_scivocab_uncased` model excels at natural language understanding tasks on scientific text, such as text classification, named entity recognition, and question answering. It can effectively capture the semantics and nuances of scientific language, outperforming general-purpose language models on many domain-specific benchmarks.

## What can I use it for?

You can use `scibert_scivocab_uncased` to build a wide range of applications that involve processing scientific text, such as:

- Automating literature review and paper summarization
- Improving search and recommendation systems for scientific publications
- Enhancing scientific knowledge extraction and hypothesis generation
- Powering chatbots and virtual assistants for researchers and scientists

The specialized vocabulary and training data of this model make it particularly well-suited for tasks that require in-depth understanding of scientific concepts and terminology.

## Things to try

One interesting aspect of `scibert_scivocab_uncased` is its ability to handle domain-specific terminology and jargon. You could try using it for tasks like:

- Extracting key technical concepts and entities from research papers
- Classifying papers into different scientific disciplines based on their content
- Generating informative abstracts or summaries of complex scientific documents
- Answering questions about the methods, findings, or implications of a research study

By leveraging the model's deep understanding of scientific language, you can develop novel applications that augment the work of researchers, clinicians, and other domain experts.