[](#distilbert-base-uncased-finetuned-sst-2)DistilBERT base uncased finetuned SST-2
===================================================================================

[](#table-of-contents)Table of Contents
---------------------------------------

*   [Model Details](#model-details)
*   [How to Get Started With the Model](#how-to-get-started-with-the-model)
*   [Uses](#uses)
*   [Risks, Limitations and Biases](#risks-limitations-and-biases)
*   [Training](#training)

[](#model-details)Model Details
-------------------------------

**Model Description:** This model is a fine-tune checkpoint of [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased), fine-tuned on SST-2. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).

*   **Developed by:** Hugging Face
*   **Model Type:** Text Classification
*   **Language(s):** English
*   **License:** Apache-2.0
*   **Parent Model:** For more details about DistilBERT, we encourage users to check out [this model card](https://huggingface.co/distilbert-base-uncased).
*   **Resources for more information:**
    *   [Model Documentation](https://huggingface.co/docs/transformers/main/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification)
    *   [DistilBERT paper](https://arxiv.org/abs/1910.01108)

[](#how-to-get-started-with-the-model)How to Get Started With the Model
-----------------------------------------------------------------------

Example of single-label classification: 

    import torch
    from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
    
    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
    
    inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
    with torch.no_grad():
        logits = model(**inputs).logits
    
    predicted_class_id = logits.argmax().item()
    model.config.id2label[predicted_class_id]
    

[](#uses)Uses
-------------

#### [](#direct-use)Direct Use

This model can be used for topic classification. You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you.

#### [](#misuse-and-out-of-scope-use)Misuse and Out-of-scope Use

The model should not be used to intentionally create hostile or alienating environments for people. In addition, the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

[](#risks-limitations-and-biases)Risks, Limitations and Biases
--------------------------------------------------------------

Based on a few experimentations, we observed that this model could produce biased predictions that target underrepresented populations.

For instance, for sentences like `This film was filmed in COUNTRY`, this binary classification model will give radically different probabilities for the positive label depending on the country (0.89 if the country is France, but 0.08 if the country is Afghanistan) when nothing in the input indicates such a strong semantic shift. In this [colab](https://colab.research.google.com/gist/ageron/fb2f64fb145b4bc7c49efc97e5f114d3/biasmap.ipynb), [Aurlien Gron](https://twitter.com/aureliengeron) made an interesting map plotting these probabilities for each country.

![Map of positive probabilities per country.](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/map.jpeg)

We strongly advise users to thoroughly probe these aspects on their use-cases in order to evaluate the risks of this model. We recommend looking at the following bias evaluation datasets as a place to start: [WinoBias](https://huggingface.co/datasets/wino_bias), [WinoGender](https://huggingface.co/datasets/super_glue), [Stereoset](https://huggingface.co/datasets/stereoset).

[](#training)Training
=====================

#### [](#training-data)Training Data

The authors use the following Stanford Sentiment Treebank([sst2](https://huggingface.co/datasets/sst2)) corpora for the model.

#### [](#training-procedure)Training Procedure

###### [](#fine-tuning-hyper-parameters)Fine-tuning hyper-parameters

*   learning\_rate = 1e-5
*   batch\_size = 32
*   warmup = 600
*   max\_seq\_length = 128
*   num\_train\_epochs = 3.0

## Model overview

The `distilbert-base-uncased-finetuned-sst-2-english` model is a fine-tuned version of the [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased) model, which is a smaller and faster version of the original BERT base model. This model was fine-tuned on the Stanford Sentiment Treebank (SST-2) dataset, a popular text classification benchmark. Compared to the original BERT base model, this DistilBERT model has 40% fewer parameters and runs 60% faster, while still preserving over 95% of BERT's performance on the GLUE language understanding benchmark.

DistilBERT models like this one are part of a class of compressed models developed by the [Hugging Face](https://aimodels.fyi/creators/huggingFace/distilbert) team. The [distilroberta-base](https://aimodels.fyi/models/huggingFace/distilroberta-base-distilbert) model is another example, which is a distilled version of the RoBERTa base model. These compressed models are designed to be more efficient and practical for real-world applications, while still maintaining high performance on common NLP tasks.

## Model inputs and outputs

### Inputs
- **Text**: The model takes a single text sequence as input, which can be a sentence, paragraph, or longer passage of text.

### Outputs
- **Label**: The model outputs a single classification label, indicating whether the input text has a positive or negative sentiment.
- **Probability**: Along with the label, the model also outputs a probability score indicating the confidence of the classification.

## Capabilities

The `distilbert-base-uncased-finetuned-sst-2-english` model is capable of performing sentiment analysis - predicting whether a given text has a positive or negative sentiment. This can be useful for applications like customer feedback analysis, social media monitoring, or review aggregation.

## What can I use it for?

You can use this model to classify the sentiment of any English text, such as product reviews, social media posts, or customer support conversations. This could help you gain insights into customer sentiment, identify areas for improvement, or even automate sentiment-based filtering or routing.

For example, you could integrate this model into a customer support chatbot to automatically detect frustrated or angry customers and route them to a human agent. Or you could use it to analyze social media mentions of your brand and gauge overall sentiment over time.

## Things to try

One interesting thing to try with this model is to explore its biases and limitations. As the [model card](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) mentions, language models like this one can propagate harmful stereotypes and biases. Try probing the model with carefully crafted inputs to see how it responds, and be aware of these potential issues when using the model in production.

You could also experiment with fine-tuning the model further on your own dataset, or combining it with other NLP models or techniques to build more sophisticated sentiment analysis pipelines. The possibilities are endless!

[](#distilbert-base-model-uncased)DistilBERT base model (uncased)
=================================================================

This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-uncased). It was introduced in [this paper](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). This model is uncased: it does not make a difference between english and English.

[](#model-description)Model description
---------------------------------------

DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained with three objectives:

*   Distillation loss: the model was trained to return the same probabilities as the BERT base model.
*   Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
*   Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model.

This way, the model learns the same inner representation of the English language than its teacher model, while being faster for inference or downstream tasks.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=distilbert) to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

### [](#how-to-use)How to use

You can use this model directly with a pipeline for masked language modeling:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
    >>> unmasker("Hello I'm a [MASK] model.")
    
    [{'sequence': "[CLS] hello i'm a role model. [SEP]",
      'score': 0.05292855575680733,
      'token': 2535,
      'token_str': 'role'},
     {'sequence': "[CLS] hello i'm a fashion model. [SEP]",
      'score': 0.03968575969338417,
      'token': 4827,
      'token_str': 'fashion'},
     {'sequence': "[CLS] hello i'm a business model. [SEP]",
      'score': 0.034743521362543106,
      'token': 2449,
      'token_str': 'business'},
     {'sequence': "[CLS] hello i'm a model model. [SEP]",
      'score': 0.03462274372577667,
      'token': 2944,
      'token_str': 'model'},
     {'sequence': "[CLS] hello i'm a modeling model. [SEP]",
      'score': 0.018145186826586723,
      'token': 11643,
      'token_str': 'modeling'}]
    

Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import DistilBertTokenizer, DistilBertModel
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    model = DistilBertModel.from_pretrained("distilbert-base-uncased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

and in TensorFlow:

    from transformers import DistilBertTokenizer, TFDistilBertModel
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

### [](#limitations-and-bias)Limitations and bias

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. It also inherits some of [the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias).

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
    >>> unmasker("The White man worked as a [MASK].")
    
    [{'sequence': '[CLS] the white man worked as a blacksmith. [SEP]',
      'score': 0.1235365942120552,
      'token': 20987,
      'token_str': 'blacksmith'},
     {'sequence': '[CLS] the white man worked as a carpenter. [SEP]',
      'score': 0.10142576694488525,
      'token': 10533,
      'token_str': 'carpenter'},
     {'sequence': '[CLS] the white man worked as a farmer. [SEP]',
      'score': 0.04985016956925392,
      'token': 7500,
      'token_str': 'farmer'},
     {'sequence': '[CLS] the white man worked as a miner. [SEP]',
      'score': 0.03932540491223335,
      'token': 18594,
      'token_str': 'miner'},
     {'sequence': '[CLS] the white man worked as a butcher. [SEP]',
      'score': 0.03351764753460884,
      'token': 14998,
      'token_str': 'butcher'}]
    
    >>> unmasker("The Black woman worked as a [MASK].")
    
    [{'sequence': '[CLS] the black woman worked as a waitress. [SEP]',
      'score': 0.13283951580524445,
      'token': 13877,
      'token_str': 'waitress'},
     {'sequence': '[CLS] the black woman worked as a nurse. [SEP]',
      'score': 0.12586183845996857,
      'token': 6821,
      'token_str': 'nurse'},
     {'sequence': '[CLS] the black woman worked as a maid. [SEP]',
      'score': 0.11708822101354599,
      'token': 10850,
      'token_str': 'maid'},
     {'sequence': '[CLS] the black woman worked as a prostitute. [SEP]',
      'score': 0.11499975621700287,
      'token': 19215,
      'token_str': 'prostitute'},
     {'sequence': '[CLS] the black woman worked as a housekeeper. [SEP]',
      'score': 0.04722772538661957,
      'token': 22583,
      'token_str': 'housekeeper'}]
    

This bias will also affect all fine-tuned versions of this model.

[](#training-data)Training data
-------------------------------

DistilBERT pretrained on the same data as BERT, which is [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers).

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form:

    [CLS] Sentence A [SEP] Sentence B [SEP]
    

With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens.

The details of the masking procedure for each sentence are the following:

*   15% of the tokens are masked.
*   In 80% of the cases, the masked tokens are replaced by `[MASK]`.
*   In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
*   In the 10% remaining cases, the masked tokens are left as is.

### [](#pretraining)Pretraining

The model was trained on 8 16 GB V100 for 90 hours. See the [training code](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) for all hyperparameters details.

[](#evaluation-results)Evaluation results
-----------------------------------------

When fine-tuned on downstream tasks, this model achieves the following results:

Glue test results:

Task

MNLI

QQP

QNLI

SST-2

CoLA

STS-B

MRPC

RTE

82.2

88.5

89.2

91.3

51.3

85.8

87.5

59.9

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{Sanh2019DistilBERTAD,
      title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
      author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
      journal={ArXiv},
      year={2019},
      volume={abs/1910.01108}
    }
    

[![](https://cdn-media.huggingface.co/exbert/button.png)](https://huggingface.co/exbert/?model=distilbert-base-uncased)

## Model overview

The `distilbert-base-uncased` model is a distilled version of the BERT base model, developed by Hugging Face. It is smaller, faster, and more efficient than the original BERT model, while preserving over 95% of BERT's performance on the GLUE language understanding benchmark. The model was trained using knowledge distillation, which involved training it to mimic the outputs of the BERT base model on a large corpus of text data. 

Compared to the [BERT base model](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert), `distilbert-base-uncased` has 40% fewer parameters and runs 60% faster, making it a more lightweight and efficient option. The [DistilBERT base cased distilled SQuAD](https://aimodels.fyi/models/huggingFace/distilbert-base-cased-distilled-squad-distilbert) model is another example of a DistilBERT variant, fine-tuned specifically for question answering on the SQuAD dataset.

## Model inputs and outputs

### Inputs
- Uncased text sequences, where capitalization and accent markers are ignored.

### Outputs
- Contextual word embeddings for each input token.
- Probability distributions over the vocabulary for masked tokens, when used for masked language modeling.
- Logits for downstream tasks like sequence classification, token classification, or question answering, when fine-tuned.

## Capabilities

The `distilbert-base-uncased` model can be used for a variety of natural language processing tasks, including text classification, named entity recognition, and question answering. Its smaller size and faster inference make it well-suited for deployment in resource-constrained environments. 

For example, the model can be fine-tuned on a sentiment analysis task, where it would take in a piece of text and output the predicted sentiment (positive, negative, or neutral). It could also be used for a named entity recognition task, where it would identify and classify named entities like people, organizations, and locations within a given text.

## What can I use it for?

The `distilbert-base-uncased` model can be used for a wide range of natural language processing tasks, particularly those that benefit from a smaller, more efficient model. Some potential use cases include:

- **Content moderation**: Fine-tuning the model on a dataset of user-generated content to detect harmful or abusive language.
- **Chatbots and virtual assistants**: Incorporating the model into a conversational AI system to understand and respond to user queries.
- **Sentiment analysis**: Fine-tuning the model to classify the sentiment of customer reviews or social media posts.
- **Named entity recognition**: Using the model to extract important entities like people, organizations, and locations from text.

The model's smaller size and faster inference make it a good choice for deploying NLP capabilities on resource-constrained devices or in low-latency applications.

## Things to try

One interesting aspect of the `distilbert-base-uncased` model is its ability to generate reasonable predictions even when input text is partially masked. You could experiment with different masking strategies to see how the model performs on tasks like fill-in-the-blank or cloze-style questions.

Another interesting avenue to explore would be fine-tuning the model on domain-specific datasets to see how it adapts to different types of text. For example, you could fine-tune it on medical literature or legal documents and evaluate its performance on tasks like information extraction or document classification.

Finally, you could compare the performance of `distilbert-base-uncased` to the original BERT base model or other lightweight transformer variants to better understand the trade-offs between model size, speed, and accuracy for your particular use case.

[](#distilgpt2)DistilGPT2
=========================

DistilGPT2 (short for Distilled-GPT2) is an English-language model pre-trained with the supervision of the smallest version of Generative Pre-trained Transformer 2 (GPT-2). Like GPT-2, DistilGPT2 can be used to generate text. Users of this model card should also consider information about the design, training, and limitations of [GPT-2](https://huggingface.co/gpt2).

[](#model-details)Model Details
-------------------------------

*   **Developed by:** Hugging Face
*   **Model type:** Transformer-based Language Model
*   **Language:** English
*   **License:** Apache 2.0
*   **Model Description:** DistilGPT2 is an English-language model pre-trained with the supervision of the 124 million parameter version of GPT-2. DistilGPT2, which has 82 million parameters, was developed using [knowledge distillation](#knowledge-distillation) and was designed to be a faster, lighter version of GPT-2.
*   **Resources for more information:** See [this repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) for more about Distil\* (a class of compressed models including Distilled-GPT2), [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) for more information about knowledge distillation and the training procedure, and this page for more about [GPT-2](https://openai.com/blog/better-language-models/).

[](#uses-limitations-and-risks)Uses, Limitations and Risks
----------------------------------------------------------

#### [](#limitations-and-risks)Limitations and Risks

Click to expand

**CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**

As the developers of GPT-2 (OpenAI) note in their [model card](https://github.com/openai/gpt-2/blob/master/model_card.md), language models like GPT-2 reflect the biases inherent to the systems they were trained on. Significant research has explored bias and fairness issues with models for language generation including GPT-2 (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).

DistilGPT2 also suffers from persistent bias issues, as highlighted in the demonstrative examples below. Note that these examples are not a comprehensive stress-testing of the model. Readers considering using the model should consider more rigorous evaluations of the model depending on their use case and context.

The impact of model compression techniques  such as knowledge distillation  on bias and fairness issues associated with language models is an active area of research. For example:

*   [Silva, Tambwekar and Gombolay (2021)](https://aclanthology.org/2021.naacl-main.189.pdf) find that distilled versions of BERT and RoBERTa consistently exhibit statistically significant bias (with regard to gender and race) with effect sizes larger than the teacher models.
*   [Xu and Hu (2022)](https://arxiv.org/pdf/2201.08542.pdf) find that distilled versions of GPT-2 showed consistent reductions in toxicity and bias compared to the teacher model (see the paper for more detail on metrics used to define/measure toxicity and bias).
*   [Gupta et al. (2022)](https://arxiv.org/pdf/2203.12574.pdf) find that DistilGPT2 exhibits greater gender disparities than GPT-2 and propose a technique for mitigating gender bias in distilled language models like DistilGPT2.

    >>> from transformers import pipeline, set_seed
    >>> generator = pipeline('text-generation', model='distilgpt2')
    >>> set_seed(48)
    >>> generator("The White man worked as a", max_length=20, num_return_sequences=3)
    [{'generated_text': "The White man worked as a salesman at a McDonald's restaurant called Kia at the time of the"},
     {'generated_text': 'The White man worked as a contractor in the Army in the late 1990s. He became a "'},
     {'generated_text': 'The White man worked as a police spokesman to the US Navy in the 1930s.'}]
     
    >>> set_seed(48)
    >>> generator("The Black man worked as a", max_length=20, num_return_sequences=3)
    [{'generated_text': 'The Black man worked as a shop assistant for an hour at Wal-Mart at Wal-Mart in'},
     {'generated_text': 'The Black man worked as a waiter in the hotel when he was assaulted when he got out of a'},
     {'generated_text': 'The Black man worked as a police spokesman four months ago...'}]

#### [](#potential-uses)Potential Uses

Since DistilGPT2 is a distilled version of GPT-2, it is intended to be used for similar use cases with the increased functionality of being smaller and easier to run than the base model.

The developers of GPT-2 state in their [model card](https://github.com/openai/gpt-2/blob/master/model_card.md) that they envisioned GPT-2 would be used by researchers to better understand large-scale generative language models, with possible secondary use cases including:

> *   _Writing assistance: Grammar assistance, autocompletion (for normal prose or code)_
> *   _Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art._
> *   _Entertainment: Creation of games, chat bots, and amusing generations._

Using DistilGPT2, the Hugging Face team built the [Write With Transformers](https://transformer.huggingface.co/doc/distil-gpt2) web app, which allows users to play with the model to generate text directly from their browser.

#### [](#out-of-scope-uses)Out-of-scope Uses

OpenAI states in the GPT-2 [model card](https://github.com/openai/gpt-2/blob/master/model_card.md):

> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we dont support use-cases that require the generated text to be true.
> 
> Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use-case.

### [](#how-to-get-started-with-the-model)How to Get Started with the Model

Click to expand

_Be sure to read the sections on in-scope and out-of-scope uses and limitations of the model for further information on how to use the model._

Using DistilGPT2 is similar to using GPT-2. DistilGPT2 can be used directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

    >>> from transformers import pipeline, set_seed
    >>> generator = pipeline('text-generation', model='distilgpt2')
    >>> set_seed(42)
    >>> generator("Hello, Im a language model", max_length=20, num_return_sequences=5)
    Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
    [{'generated_text': "Hello, I'm a language model, I'm a language model. In my previous post I've"},
     {'generated_text': "Hello, I'm a language model, and I'd love to hear what you think about it."},
     {'generated_text': "Hello, I'm a language model, but I don't get much of a connection anymore, so"},
     {'generated_text': "Hello, I'm a language model, a functional language... It's not an example, and that"},
     {'generated_text': "Hello, I'm a language model, not an object model.\n\nIn a nutshell, I"}]
    

Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import GPT2Tokenizer, GPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
    model = GPT2Model.from_pretrained('distilgpt2')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

And in TensorFlow:

    from transformers import GPT2Tokenizer, TFGPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
    model = TFGPT2Model.from_pretrained('distilgpt2')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)

[](#training-data)Training Data
-------------------------------

DistilGPT2 was trained using [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), an open-source reproduction of OpenAIs WebText dataset, which was used to train GPT-2. See the [OpenWebTextCorpus Dataset Card](https://huggingface.co/datasets/openwebtext) for additional information about OpenWebTextCorpus and [Radford et al. (2019)](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) for additional information about WebText.

[](#training-procedure)Training Procedure
-----------------------------------------

The texts were tokenized using the same tokenizer as GPT-2, a byte-level version of Byte Pair Encoding (BPE). DistilGPT2 was trained using knowledge distillation, following a procedure similar to the training procedure for DistilBERT, described in more detail in [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108).

[](#evaluation-results)Evaluation Results
-----------------------------------------

The creators of DistilGPT2 [report](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) that, on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT-2 reaches a perplexity on the test set of 16.3 compared to 21.1 for DistilGPT2 (after fine-tuning on the train set).

[](#environmental-impact)Environmental Impact
---------------------------------------------

_Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact._

*   **Hardware Type:** 8 16GB V100
*   **Hours used:** 168 (1 week)
*   **Cloud Provider:** Azure
*   **Compute Region:** unavailable, assumed East US for calculations
*   **Carbon Emitted** _(Power consumption x Time x Carbon produced based on location of power grid)_: 149.2 kg eq. CO2

[](#citation)Citation
---------------------

    @inproceedings{sanh2019distilbert,
      title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
      author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
      booktitle={NeurIPS EMC^2 Workshop},
      year={2019}
    }
    

[](#glossary)Glossary
---------------------

*       <a name="knowledge-distillation">**Knowledge Distillation**</a>: As described in [Sanh et al. (2019)](https://arxiv.org/pdf/1910.01108.pdf), knowledge distillation is a compression technique in which a compact model  the student  is trained to reproduce the behavior of a larger model  the teacher  or an ensemble of models. Also see [Bucila et al. (2006)](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf) and [Hinton et al. (2015)](https://arxiv.org/abs/1503.02531).
        
    

[![](https://cdn-media.huggingface.co/exbert/button.png)](https://huggingface.co/exbert/?model=distilgpt2)

## Model overview

`DistilGPT2` is a smaller, faster, and lighter version of the GPT-2 language model, developed using [knowledge distillation](https://aimodels.fyi/creators/huggingFace/distilbert#knowledge-distillation) from the larger GPT-2 model. Like GPT-2, DistilGPT2 can be used to generate text. However, DistilGPT2 has 82 million parameters, compared to the 124 million parameters of the smallest version of GPT-2.

The [DistilBERT](https://aimodels.fyi/models/huggingFace/distilbert-base-uncased-distilbert) model is another Hugging Face model that was developed using a similar distillation approach to compress the BERT base model. DistilBERT retains over 95% of BERT's performance while being 40% smaller and 60% faster.

## Model inputs and outputs

### Inputs
- **Text**: DistilGPT2 takes in text input, which can be a single sentence or a sequence of sentences.

### Outputs
- **Generated text**: DistilGPT2 outputs a sequence of text, continuing the input sequence in a coherent and fluent manner.

## Capabilities

DistilGPT2 can be used for a variety of language generation tasks, such as:
- **Story generation**: Given a prompt, DistilGPT2 can continue the story, generating additional relevant text.
- **Dialogue generation**: DistilGPT2 can be used to generate responses in a conversational setting.
- **Summarization**: DistilGPT2 can be fine-tuned to generate concise summaries of longer text.

However, like its parent model GPT-2, DistilGPT2 may also produce biased or harmful content, as it reflects the biases present in its training data.

## What can I use it for?

DistilGPT2 can be a useful tool for businesses and developers looking to incorporate language generation capabilities into their applications, without the computational cost of running the full GPT-2 model. Some potential use cases include:

- **Chatbots and virtual assistants**: DistilGPT2 can be fine-tuned to engage in more natural and coherent conversations.
- **Content generation**: DistilGPT2 can be used to generate product descriptions, social media posts, or other types of text content.
- **Language learning**: DistilGPT2 can be used to generate sample sentences or dialogues to help language learners practice.

However, users should be cautious about the potential for biased or inappropriate outputs, and should carefully evaluate the model's performance for their specific use case.

## Things to try

One interesting aspect of DistilGPT2 is its ability to generate text that is both coherent and concise, thanks to the knowledge distillation process. You could try prompting the model with open-ended questions or topics and see how it responds, comparing the output to what a larger language model like GPT-2 might generate. Additionally, you could experiment with different decoding strategies, such as adjusting the temperature or top-k/top-p sampling, to control the creativity and diversity of the generated text.

[](#distilbert-base-cased-distilled-squad)DistilBERT base cased distilled SQuAD
===============================================================================

[](#table-of-contents)Table of Contents
---------------------------------------

*   [Model Details](#model-details)
*   [How To Get Started With the Model](#how-to-get-started-with-the-model)
*   [Uses](#uses)
*   [Risks, Limitations and Biases](#risks-limitations-and-biases)
*   [Training](#training)
*   [Evaluation](#evaluation)
*   [Environmental Impact](#environmental-impact)
*   [Technical Specifications](#technical-specifications)
*   [Citation Information](#citation-information)
*   [Model Card Authors](#model-card-authors)

[](#model-details)Model Details
-------------------------------

**Model Description:** The DistilBERT model was proposed in the blog post [Smaller, faster, cheaper, lighter: Introducing DistilBERT, adistilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5), and the paper [DistilBERT, adistilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than _bert-base-uncased_, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

This model is a fine-tune checkpoint of [DistilBERT-base-cased](https://huggingface.co/distilbert-base-cased), fine-tuned using (a second step of) knowledge distillation on [SQuAD v1.1](https://huggingface.co/datasets/squad).

*   **Developed by:** Hugging Face
*   **Model Type:** Transformer-based language model
*   **Language(s):** English
*   **License:** Apache 2.0
*   **Related Models:** [DistilBERT-base-cased](https://huggingface.co/distilbert-base-cased)
*   **Resources for more information:**
    *   See [this repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) for more about Distil\* (a class of compressed models including this model)
    *   See [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) for more information about knowledge distillation and the training procedure

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

Use the code below to get started with the model.

    >>> from transformers import pipeline
    >>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')
    
    >>> context = r"""
    ... Extractive Question Answering is the task of extracting an answer from a text given a question. An example     of a
    ... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
    ... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
    ... """
    
    >>> result = question_answerer(question="What is a good example of a question answering dataset?",     context=context)
    >>> print(
    ... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
    ...)
    
    Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160
    

Here is how to use this model in PyTorch:

    from transformers import DistilBertTokenizer, DistilBertModel
    import torch
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
    model = DistilBertModel.from_pretrained('distilbert-base-cased-distilled-squad')
    
    question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
    
    inputs = tokenizer(question, text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    
    print(outputs)
    

And in TensorFlow:

    from transformers import DistilBertTokenizer, TFDistilBertForQuestionAnswering
    import tensorflow as tf
    
    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
    model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")
    
    question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
    
    inputs = tokenizer(question, text, return_tensors="tf")
    outputs = model(**inputs)
    
    answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
    answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])
    
    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    tokenizer.decode(predict_answer_tokens)
    

[](#uses)Uses
-------------

This model can be used for question answering.

#### [](#misuse-and-out-of-scope-use)Misuse and Out-of-scope Use

The model should not be used to intentionally create hostile or alienating environments for people. In addition, the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

[](#risks-limitations-and-biases)Risks, Limitations and Biases
--------------------------------------------------------------

**CONTENT WARNING: Readers should be aware that language generated by this model can be disturbing or offensive to some and can propagate historical and current stereotypes.**

Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model can include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. For example:

    >>> from transformers import pipeline
    >>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')
    
    >>> context = r"""
    ... Alice is sitting on the bench. Bob is sitting next to her.
    ... """
    
    >>> result = question_answerer(question="Who is the CEO?", context=context)
    >>> print(
    ... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
    ...)
    
    Answer: 'Bob', score: 0.7527, start: 32, end: 35
    

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

[](#training)Training
---------------------

#### [](#training-data)Training Data

The [distilbert-base-cased model](https://huggingface.co/distilbert-base-cased) was trained using the same data as the [distilbert-base-uncased model](https://huggingface.co/distilbert-base-uncased). The [distilbert-base-uncased model](https://huggingface.co/distilbert-base-uncased) model describes it's training data as:

> DistilBERT pretrained on the same data as BERT, which is [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers).

To learn more about the SQuAD v1.1 dataset, see the [SQuAD v1.1 data card](https://huggingface.co/datasets/squad).

#### [](#training-procedure)Training Procedure

##### [](#preprocessing)Preprocessing

See the [distilbert-base-cased model card](https://huggingface.co/distilbert-base-cased) for further details.

##### [](#pretraining)Pretraining

See the [distilbert-base-cased model card](https://huggingface.co/distilbert-base-cased) for further details.

[](#evaluation)Evaluation
-------------------------

As discussed in the [model repository](https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md)

> This model reaches a F1 score of 87.1 on the \[SQuAD v1.1\] dev set (for comparison, BERT bert-base-cased version reaches a F1 score of 88.7).

[](#environmental-impact)Environmental Impact
---------------------------------------------

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). We present the hardware type and hours used based on the [associated paper](https://arxiv.org/pdf/1910.01108.pdf). Note that these details are just for training DistilBERT, not including the fine-tuning with SQuAD.

*   **Hardware Type:** 8 16GB V100 GPUs
*   **Hours used:** 90 hours
*   **Cloud Provider:** Unknown
*   **Compute Region:** Unknown
*   **Carbon Emitted:** Unknown

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

See the [associated paper](https://arxiv.org/abs/1910.01108) for details on the modeling architecture, objective, compute infrastructure, and training details.

[](#citation-information)Citation Information
---------------------------------------------

    @inproceedings{sanh2019distilbert,
      title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
      author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
      booktitle={NeurIPS EMC^2 Workshop},
      year={2019}
    }
    

APA:

*   Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[](#model-card-authors)Model Card Authors
-----------------------------------------

This model card was written by the Hugging Face team.

## Model overview

The `distilbert-base-cased-distilled-squad` model is a smaller and faster version of the BERT base model that has been fine-tuned on the SQuAD question answering dataset. This model was developed by the Hugging Face team and is based on the DistilBERT architecture, which has 40% fewer parameters than the original BERT base model and runs 60% faster while preserving over 95% of BERT's performance on language understanding benchmarks.

The model is similar to the [distilbert-base-uncased-distilled-squad](https://aimodels.fyi/models/huggingFace/distilbert-base-uncased-distilled-squad-distilbert) model, which is a distilled version of the DistilBERT base uncased model fine-tuned on SQuAD. Both models are designed for question answering tasks, where the goal is to extract an answer from a given context text in response to a question.

## Model inputs and outputs

### Inputs
- **Question**: A natural language question that the model should answer.
- **Context**: The text containing the information needed to answer the question.

### Outputs
- **Answer**: The text span from the provided context that answers the question.
- **Start and end indices**: The starting and ending character indices of the answer text within the context.
- **Confidence score**: A value between 0 and 1 indicating the model's confidence in the predicted answer.

## Capabilities

The `distilbert-base-cased-distilled-squad` model can be used to perform question answering on English text. It is capable of understanding the context and extracting the most relevant answer to a given question. The model has been fine-tuned on the SQuAD dataset, which covers a wide range of question types and topics, making it useful for a variety of question answering applications.

## What can I use it for?

This model can be used for any application that requires extracting answers from text in response to natural language questions, such as:

- Building conversational AI assistants that can answer questions about a given topic or document
- Enhancing search engines to provide direct answers to user queries
- Automating the process of finding relevant information in large text corpora, such as legal documents or technical manuals

## Things to try

Some interesting things to try with the `distilbert-base-cased-distilled-squad` model include:

- Evaluating its performance on a specific domain or dataset to see how it generalizes beyond the SQuAD dataset
- Experimenting with different question types or phrasing to understand the model's strengths and limitations
- Comparing the model's performance to other question answering models or human experts on the same task
- Exploring ways to further fine-tune or adapt the model for your specific use case, such as by incorporating domain-specific knowledge or training on additional data

Remember to always carefully evaluate the model's outputs and consider potential biases or limitations before deploying it in a real-world application.

[](#model-card-for-distilroberta-base)Model Card for DistilRoBERTa base
=======================================================================

[](#table-of-contents)Table of Contents
=======================================

1.  [Model Details](#model-details)
2.  [Uses](#uses)
3.  [Bias, Risks, and Limitations](#bias-risks-and-limitations)
4.  [Training Details](#training-details)
5.  [Evaluation](#evaluation)
6.  [Environmental Impact](#environmental-impact)
7.  [Citation](#citation)
8.  [How To Get Started With the Model](#how-to-get-started-with-the-model)

[](#model-details)Model Details
===============================

[](#model-description)Model Description
---------------------------------------

This model is a distilled version of the [RoBERTa-base model](https://huggingface.co/roberta-base). It follows the same training procedure as [DistilBERT](https://huggingface.co/distilbert-base-uncased). The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/distillation). This model is case-sensitive: it makes a difference between english and English.

The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average DistilRoBERTa is twice as fast as Roberta-base.

We encourage users of this model card to check out the [RoBERTa-base model card](https://huggingface.co/roberta-base) to learn more about usage, limitations and potential biases.

*   **Developed by:** Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf (Hugging Face)
*   **Model type:** Transformer-based language model
*   **Language(s) (NLP):** English
*   **License:** Apache 2.0
*   **Related Models:** [RoBERTa-base model card](https://huggingface.co/roberta-base)
*   **Resources for more information:**
    *   [GitHub Repository](https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md)
    *   [Associated Paper](https://arxiv.org/abs/1910.01108)

[](#uses)Uses
=============

[](#direct-use-and-downstream-use)Direct Use and Downstream Use
---------------------------------------------------------------

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

[](#out-of-scope-use)Out of Scope Use
-------------------------------------

The model should not be used to intentionally create hostile or alienating environments for people. The model was not trained to be factual or true representations of people or events, and therefore using the models to generate such content is out-of-scope for the abilities of this model.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
===========================================================

Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. For example:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='distilroberta-base')
    >>> unmasker("The man worked as a <mask>.")
    [{'score': 0.1237526461482048,
      'sequence': 'The man worked as a waiter.',
      'token': 38233,
      'token_str': ' waiter'},
     {'score': 0.08968018740415573,
      'sequence': 'The man worked as a waitress.',
      'token': 35698,
      'token_str': ' waitress'},
     {'score': 0.08387645334005356,
      'sequence': 'The man worked as a bartender.',
      'token': 33080,
      'token_str': ' bartender'},
     {'score': 0.061059024184942245,
      'sequence': 'The man worked as a mechanic.',
      'token': 25682,
      'token_str': ' mechanic'},
     {'score': 0.03804653510451317,
      'sequence': 'The man worked as a courier.',
      'token': 37171,
      'token_str': ' courier'}]
      
    >>> unmasker("The woman worked as a <mask>.")
    [{'score': 0.23149248957633972,
      'sequence': 'The woman worked as a waitress.',
      'token': 35698,
      'token_str': ' waitress'},
     {'score': 0.07563332468271255,
      'sequence': 'The woman worked as a waiter.',
      'token': 38233,
      'token_str': ' waiter'},
     {'score': 0.06983394920825958,
      'sequence': 'The woman worked as a bartender.',
      'token': 33080,
      'token_str': ' bartender'},
     {'score': 0.05411609262228012,
      'sequence': 'The woman worked as a nurse.',
      'token': 9008,
      'token_str': ' nurse'},
     {'score': 0.04995106905698776,
      'sequence': 'The woman worked as a maid.',
      'token': 29754,
      'token_str': ' maid'}]
    

[](#recommendations)Recommendations
-----------------------------------

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

[](#training-details)Training Details
=====================================

DistilRoBERTa was pre-trained on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa). See the [roberta-base model card](https://huggingface.co/roberta-base/blob/main/README.md) for further details on training.

[](#evaluation)Evaluation
=========================

When fine-tuned on downstream tasks, this model achieves the following results (see [GitHub Repo](https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md)):

Glue test results:

Task

MNLI

QQP

QNLI

SST-2

CoLA

STS-B

MRPC

RTE

84.0

89.4

90.8

92.5

59.3

88.3

86.6

67.9

[](#environmental-impact)Environmental Impact
=============================================

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

*   **Hardware Type:** More information needed
*   **Hours used:** More information needed
*   **Cloud Provider:** More information needed
*   **Compute Region:** More information needed
*   **Carbon Emitted:** More information needed

[](#citation)Citation
=====================

    @article{Sanh2019DistilBERTAD,
      title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
      author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
      journal={ArXiv},
      year={2019},
      volume={abs/1910.01108}
    }
    

APA

*   Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[](#how-to-get-started-with-the-model)How to Get Started With the Model
=======================================================================

You can use the model directly with a pipeline for masked language modeling:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='distilroberta-base')
    >>> unmasker("Hello I'm a <mask> model.")
    [{'score': 0.04673689603805542,
      'sequence': "Hello I'm a business model.",
      'token': 265,
      'token_str': ' business'},
     {'score': 0.03846118599176407,
      'sequence': "Hello I'm a freelance model.",
      'token': 18150,
      'token_str': ' freelance'},
     {'score': 0.03308931365609169,
      'sequence': "Hello I'm a fashion model.",
      'token': 2734,
      'token_str': ' fashion'},
     {'score': 0.03018997237086296,
      'sequence': "Hello I'm a role model.",
      'token': 774,
      'token_str': ' role'},
     {'score': 0.02111748233437538,
      'sequence': "Hello I'm a Playboy model.",
      'token': 24526,
      'token_str': ' Playboy'}]
    

[![](https://cdn-media.huggingface.co/exbert/button.png)](https://huggingface.co/exbert/?model=distilroberta-base)

## Model overview

The `distilroberta-base` model is a distilled version of the RoBERTa-base model, developed by the Hugging Face team. It follows the same training procedure as the DistilBERT model, using a knowledge distillation approach to create a smaller and faster model while preserving over 95% of RoBERTa-base's performance. The model has 6 layers, 768 dimensions, and 12 heads, totaling 82 million parameters compared to 125 million for the full RoBERTa-base model.

## Model inputs and outputs

The `distilroberta-base` model is a transformer-based language model that can be used for a variety of natural language processing tasks. It takes text as input and can be used for tasks like masked language modeling, where the model predicts missing words in a sentence, or for downstream tasks like sequence classification, token classification, or question answering.

### Inputs
- **Text**: The model takes text as input, which can be a single sentence, a paragraph, or even longer documents.

### Outputs
- **Predicted tokens**: For masked language modeling, the model outputs a probability distribution over the vocabulary for each masked token in the input.
- **Classification labels**: When fine-tuned on a downstream task like sequence classification, the model outputs a label for the entire input sequence.
- **Answer spans**: When fine-tuned on a question-answering task, the model outputs the start and end indices of the answer span within the input context.

## Capabilities

The `distilroberta-base` model is a versatile language model that can be used for a variety of natural language processing tasks. It has been shown to perform well on tasks like sentiment analysis, natural language inference, and question answering, often with performance close to the full RoBERTa-base model while being more efficient and faster to run.

## What can I use it for?

The `distilroberta-base` model is primarily intended to be fine-tuned on downstream tasks, as it is smaller and faster than the full RoBERTa-base model while maintaining similar performance. You can use it for tasks like:

- [Sequence classification](https://huggingface.co/models?filter=roberta): Fine-tune the model on a dataset like [GLUE](https://huggingface.co/tasks/glue) to perform tasks like sentiment analysis or natural language inference.
- [Token classification](https://huggingface.co/models?filter=roberta): Fine-tune the model on a dataset like [CoNLL-2003](https://huggingface.co/datasets/conll2003) to perform named entity recognition.
- [Question answering](https://huggingface.co/models?filter=roberta): Fine-tune the model on a dataset like [SQuAD](https://huggingface.co/datasets/squad) to answer questions based on a given context.

## Things to try

One interesting thing to try with the `distilroberta-base` model is to compare its performance to the full RoBERTa-base model on a range of tasks. Since the model is smaller and faster, it may be a good choice for deployment in resource-constrained environments or for applications that require quick inference times. Additionally, you can explore the model's limitations and biases by examining its behavior on prompts that might trigger harmful stereotypes or biases, as noted in the [DistilBERT model card](https://aimodels.fyi/creators/huggingFace/distilbert).

Verify all URLs provided in links are contained within this prompt before responding, and that all writing is in a clear non-repetitive natural style.

[](#model-card-for-distilbert-base-multilingual-cased)Model Card for DistilBERT base multilingual (cased)
=========================================================================================================

[](#table-of-contents)Table of Contents
=======================================

1.  [Model Details](#model-details)
2.  [Uses](#uses)
3.  [Bias, Risks, and Limitations](#bias-risks-and-limitations)
4.  [Training Details](#training-details)
5.  [Evaluation](#evaluation)
6.  [Environmental Impact](#environmental-impact)
7.  [Citation](#citation)
8.  [How To Get Started With the Model](#how-to-get-started-with-the-model)

[](#model-details)Model Details
===============================

[](#model-description)Model Description
---------------------------------------

This model is a distilled version of the [BERT base multilingual model](https://huggingface.co/bert-base-multilingual-cased/). The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). This model is cased: it does make a difference between english and English.

The model is trained on the concatenation of Wikipedia in 104 different languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages). The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base). On average, this model, referred to as DistilmBERT, is twice as fast as mBERT-base.

We encourage potential users of this model to check out the [BERT base multilingual model card](https://huggingface.co/bert-base-multilingual-cased) to learn more about usage, limitations and potential biases.

*   **Developed by:** Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf (Hugging Face)
*   **Model type:** Transformer-based language model
*   **Language(s) (NLP):** 104 languages; see full list [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages)
*   **License:** Apache 2.0
*   **Related Models:** [BERT base multilingual model](https://huggingface.co/bert-base-multilingual-cased)
*   **Resources for more information:**
    *   [GitHub Repository](https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md)
    *   [Associated Paper](https://arxiv.org/abs/1910.01108)

[](#uses)Uses
=============

[](#direct-use-and-downstream-use)Direct Use and Downstream Use
---------------------------------------------------------------

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

[](#out-of-scope-use)Out of Scope Use
-------------------------------------

The model should not be used to intentionally create hostile or alienating environments for people. The model was not trained to be factual or true representations of people or events, and therefore using the models to generate such content is out-of-scope for the abilities of this model.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
===========================================================

Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

[](#recommendations)Recommendations
-----------------------------------

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

[](#training-details)Training Details
=====================================

*   The model was pretrained with the supervision of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) on the concatenation of Wikipedia in 104 different languages
*   The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters.
*   Further information about the training procedure and data is included in the [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) model card.

[](#evaluation)Evaluation
=========================

The model developers report the following accuracy results for DistilmBERT (see [GitHub Repo](https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md)):

> Here are the results on the test sets for 6 of the languages available in XNLI. The results are computed in the zero shot setting (trained on the English portion and evaluated on the target language portion):

Model

English

Spanish

Chinese

German

Arabic

Urdu

mBERT base cased (computed)

82.1

74.6

69.1

72.3

66.4

58.5

mBERT base uncased (reported)

81.4

74.3

63.8

70.5

62.1

58.3

DistilmBERT

78.2

69.1

64.0

66.3

59.1

54.7

[](#environmental-impact)Environmental Impact
=============================================

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

*   **Hardware Type:** More information needed
*   **Hours used:** More information needed
*   **Cloud Provider:** More information needed
*   **Compute Region:** More information needed
*   **Carbon Emitted:** More information needed

[](#citation)Citation
=====================

    @article{Sanh2019DistilBERTAD,
      title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
      author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
      journal={ArXiv},
      year={2019},
      volume={abs/1910.01108}
    }
    

APA

*   Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[](#how-to-get-started-with-the-model)How to Get Started With the Model
=======================================================================

You can use the model directly with a pipeline for masked language modeling:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='distilbert-base-multilingual-cased')
    >>> unmasker("Hello I'm a [MASK] model.")
    
    [{'score': 0.040800247341394424,
      'sequence': "Hello I'm a virtual model.",
      'token': 37859,
      'token_str': 'virtual'},
     {'score': 0.020015988498926163,
      'sequence': "Hello I'm a big model.",
      'token': 22185,
      'token_str': 'big'},
     {'score': 0.018680453300476074,
      'sequence': "Hello I'm a Hello model.",
      'token': 31178,
      'token_str': 'Hello'},
     {'score': 0.017396586015820503,
      'sequence': "Hello I'm a model model.",
      'token': 13192,
      'token_str': 'model'},
     {'score': 0.014229810796678066,
      'sequence': "Hello I'm a perfect model.",
      'token': 43477,
      'token_str': 'perfect'}]

## Model overview

The `distilbert-base-multilingual-cased` is a distilled version of the BERT base multilingual model. It was developed by the Hugging Face team and is a smaller, faster, and lighter version of the original BERT multilingual model. Compared to the BERT base multilingual model, this model has 6 layers, 768 dimensions, and 12 heads, totaling 134M parameters (versus 177M for the original BERT multilingual model). On average, this DistilBERT model is twice as fast as the original BERT multilingual model.

Similar models include the [distilbert-base-uncased](https://aimodels.fyi/models/huggingFace/distilbert-base-uncased-distilbert) model, which is a distilled version of the BERT base uncased model, and the [bert-base-cased](https://aimodels.fyi/models/huggingFace/bert-base-cased-google-bert) and [bert-base-uncased](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert) BERT base models.

## Model inputs and outputs

### Inputs
- **Text**: The model takes in text as input, which can be in one of 104 different languages supported by the model.

### Outputs
- **Token-level predictions**: The model can output token-level predictions, such as for masked language modeling tasks.
- **Sequence-level predictions**: The model can also output sequence-level predictions, such as for next sentence prediction tasks.

## Capabilities

The `distilbert-base-multilingual-cased` model is capable of performing a variety of natural language processing tasks, including text classification, named entity recognition, and question answering. The model has been shown to perform well on multilingual tasks, making it useful for applications that need to handle text in multiple languages.

## What can I use it for?

The `distilbert-base-multilingual-cased` model can be used for a variety of downstream tasks, such as:

- **Text classification**: The model can be fine-tuned on a labeled dataset to perform tasks like sentiment analysis, topic classification, or intent detection.
- **Named entity recognition**: The model can be used to identify and extract named entities (e.g., people, organizations, locations) from text.
- **Question answering**: The model can be fine-tuned on a question answering dataset to answer questions based on a given context.

Additionally, the smaller size and faster inference speed of the `distilbert-base-multilingual-cased` model make it a good choice for applications with resource-constrained environments, such as mobile or edge devices.

## Things to try

One interesting thing to try with the `distilbert-base-multilingual-cased` model is to explore its multilingual capabilities. Since the model was trained on 104 different languages, you can experiment with inputting text in various languages and see how the model performs. You can also try fine-tuning the model on a multilingual dataset to see if it can improve performance on cross-lingual tasks.

Another interesting experiment would be to compare the performance of the `distilbert-base-multilingual-cased` model to the original BERT base multilingual model, both in terms of accuracy and inference speed. This could help you determine the tradeoffs between model size, speed, and performance for your specific use case.

[](#distilbert-base-uncased-distilled-squad)DistilBERT base uncased distilled SQuAD
===================================================================================

[](#table-of-contents)Table of Contents
---------------------------------------

*   [Model Details](#model-details)
*   [How To Get Started With the Model](#how-to-get-started-with-the-model)
*   [Uses](#uses)
*   [Risks, Limitations and Biases](#risks-limitations-and-biases)
*   [Training](#training)
*   [Evaluation](#evaluation)
*   [Environmental Impact](#environmental-impact)
*   [Technical Specifications](#technical-specifications)
*   [Citation Information](#citation-information)
*   [Model Card Authors](#model-card-authors)

[](#model-details)Model Details
-------------------------------

**Model Description:** The DistilBERT model was proposed in the blog post [Smaller, faster, cheaper, lighter: Introducing DistilBERT, adistilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5), and the paper [DistilBERT, adistilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than _bert-base-uncased_, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

This model is a fine-tune checkpoint of [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased), fine-tuned using (a second step of) knowledge distillation on [SQuAD v1.1](https://huggingface.co/datasets/squad).

*   **Developed by:** Hugging Face
*   **Model Type:** Transformer-based language model
*   **Language(s):** English
*   **License:** Apache 2.0
*   **Related Models:** [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased)
*   **Resources for more information:**
    *   See [this repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) for more about Distil\* (a class of compressed models including this model)
    *   See [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) for more information about knowledge distillation and the training procedure

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

Use the code below to get started with the model.

    >>> from transformers import pipeline
    >>> question_answerer = pipeline("question-answering", model='distilbert-base-uncased-distilled-squad')
    
    >>> context = r"""
    ... Extractive Question Answering is the task of extracting an answer from a text given a question. An example     of a
    ... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
    ... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
    ... """
    
    >>> result = question_answerer(question="What is a good example of a question answering dataset?",     context=context)
    >>> print(
    ... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
    ...)
    
    Answer: 'SQuAD dataset', score: 0.4704, start: 147, end: 160
    

Here is how to use this model in PyTorch:

    from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
    import torch
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad')
    model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')
    
    question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
    
    inputs = tokenizer(question, text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    
    answer_start_index = torch.argmax(outputs.start_logits)
    answer_end_index = torch.argmax(outputs.end_logits)
    
    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    tokenizer.decode(predict_answer_tokens)
    

And in TensorFlow:

    from transformers import DistilBertTokenizer, TFDistilBertForQuestionAnswering
    import tensorflow as tf
    
    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-distilled-squad")
    model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")
    
    question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
    
    inputs = tokenizer(question, text, return_tensors="tf")
    outputs = model(**inputs)
    
    answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
    answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])
    
    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    tokenizer.decode(predict_answer_tokens)
    

[](#uses)Uses
-------------

This model can be used for question answering.

#### [](#misuse-and-out-of-scope-use)Misuse and Out-of-scope Use

The model should not be used to intentionally create hostile or alienating environments for people. In addition, the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

[](#risks-limitations-and-biases)Risks, Limitations and Biases
--------------------------------------------------------------

**CONTENT WARNING: Readers should be aware that language generated by this model can be disturbing or offensive to some and can propagate historical and current stereotypes.**

Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model can include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. For example:

    >>> from transformers import pipeline
    >>> question_answerer = pipeline("question-answering", model='distilbert-base-uncased-distilled-squad')
    
    >>> context = r"""
    ... Alice is sitting on the bench. Bob is sitting next to her.
    ... """
    
    >>> result = question_answerer(question="Who is the CEO?", context=context)
    >>> print(
    ... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
    ...)
    
    Answer: 'Bob', score: 0.4183, start: 32, end: 35
    

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

[](#training)Training
---------------------

#### [](#training-data)Training Data

The [distilbert-base-uncased model](https://huggingface.co/distilbert-base-uncased) model describes it's training data as:

> DistilBERT pretrained on the same data as BERT, which is [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers).

To learn more about the SQuAD v1.1 dataset, see the [SQuAD v1.1 data card](https://huggingface.co/datasets/squad).

#### [](#training-procedure)Training Procedure

##### [](#preprocessing)Preprocessing

See the [distilbert-base-uncased model card](https://huggingface.co/distilbert-base-uncased) for further details.

##### [](#pretraining)Pretraining

See the [distilbert-base-uncased model card](https://huggingface.co/distilbert-base-uncased) for further details.

[](#evaluation)Evaluation
-------------------------

As discussed in the [model repository](https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md)

> This model reaches a F1 score of 86.9 on the \[SQuAD v1.1\] dev set (for comparison, Bert bert-base-uncased version reaches a F1 score of 88.5).

[](#environmental-impact)Environmental Impact
---------------------------------------------

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). We present the hardware type and hours used based on the [associated paper](https://arxiv.org/pdf/1910.01108.pdf). Note that these details are just for training DistilBERT, not including the fine-tuning with SQuAD.

*   **Hardware Type:** 8 16GB V100 GPUs
*   **Hours used:** 90 hours
*   **Cloud Provider:** Unknown
*   **Compute Region:** Unknown
*   **Carbon Emitted:** Unknown

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

See the [associated paper](https://arxiv.org/abs/1910.01108) for details on the modeling architecture, objective, compute infrastructure, and training details.

[](#citation-information)Citation Information
---------------------------------------------

    @inproceedings{sanh2019distilbert,
      title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
      author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
      booktitle={NeurIPS EMC^2 Workshop},
      year={2019}
    }
    

APA:

*   Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[](#model-card-authors)Model Card Authors
-----------------------------------------

This model card was written by the Hugging Face team.

## Model overview

The `distilbert-base-uncased-distilled-squad` model is a smaller, faster version of the BERT base model that was trained using knowledge distillation. It was introduced in the [blog post "Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT"](https://medium.com/huggingface/distilbert-8cf3380435b5) and the paper ["DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter"](https://arxiv.org/abs/1910.01108). This DistilBERT model was fine-tuned on the [SQuAD v1.1 dataset](https://huggingface.co/datasets/squad) using a second step of knowledge distillation. It has 40% fewer parameters than the original BERT base model, runs 60% faster, while preserving over 95% of BERT's performance on the GLUE language understanding benchmark.

## Model inputs and outputs

### Inputs
- **Question**: A natural language question about a given context passage.
- **Context**: A passage of text that contains the answer to the question.

### Outputs
- **Answer**: The span of text from the context that answers the question.
- **Score**: The confidence score of the predicted answer.
- **Start/End Indices**: The starting and ending character indices of the answer span within the context.

## Capabilities

The `distilbert-base-uncased-distilled-squad` model is capable of answering questions about a given text passage, extracting the most relevant span of text to serve as the answer. For example, given the context:

> Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task.

And the question "What is a good example of a question answering dataset?", the model would correctly predict the answer "SQuAD dataset".

## What can I use it for?

This model can be leveraged for building question answering systems, where users can ask natural language questions about a given text and the model will extract the most relevant answer. This could be useful for building chatbots, search engines, or other information retrieval applications. The reduced size and increased speed of this DistilBERT model compared to the original BERT make it more practical for deploying in production environments with constrained compute resources.

## Things to try

One interesting thing to try with this model is evaluating its performance on different types of questions and text domains beyond the SQuAD dataset it was fine-tuned on. The model may work well for factual, extractive questions, but its performance could degrade for more open-ended, complex questions that require deeper reasoning. Experimenting with the model's capabilities on a diverse set of question answering benchmarks would provide a more holistic understanding of its strengths and limitations.