[](#bert-base-model-uncased)BERT base model (uncased)
=====================================================

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is uncased: it does not make a difference between english and English.

Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

*   Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence.
*   Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

[](#model-variations)Model variations
-------------------------------------

BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers.  
Chinese and multilingual uncased and cased versions followed shortly after.  
Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models.  
Other 24 smaller models are released afterward.

The detailed release history can be found on the [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) on github.

Model

#params

Language

[`bert-base-uncased`](https://huggingface.co/bert-base-uncased)

110M

English

[`bert-large-uncased`](https://huggingface.co/bert-large-uncased)

340M

English

[`bert-base-cased`](https://huggingface.co/bert-base-cased)

110M

English

[`bert-large-cased`](https://huggingface.co/bert-large-cased)

340M

English

[`bert-base-chinese`](https://huggingface.co/bert-base-chinese)

110M

Chinese

[`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased)

110M

Multiple

[`bert-large-uncased-whole-word-masking`](https://huggingface.co/bert-large-uncased-whole-word-masking)

340M

English

[`bert-large-cased-whole-word-masking`](https://huggingface.co/bert-large-cased-whole-word-masking)

340M

English

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for fine-tuned versions of a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

### [](#how-to-use)How to use

You can use this model directly with a pipeline for masked language modeling:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
    >>> unmasker("Hello I'm a [MASK] model.")
    
    [{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
      'score': 0.1073106899857521,
      'token': 4827,
      'token_str': 'fashion'},
     {'sequence': "[CLS] hello i'm a role model. [SEP]",
      'score': 0.08774490654468536,
      'token': 2535,
      'token_str': 'role'},
     {'sequence': "[CLS] hello i'm a new model. [SEP]",
      'score': 0.05338378623127937,
      'token': 2047,
      'token_str': 'new'},
     {'sequence': "[CLS] hello i'm a super model. [SEP]",
      'score': 0.04667217284440994,
      'token': 3565,
      'token_str': 'super'},
     {'sequence': "[CLS] hello i'm a fine model. [SEP]",
      'score': 0.027095865458250046,
      'token': 2986,
      'token_str': 'fine'}]
    

Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained("bert-base-uncased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

and in TensorFlow:

    from transformers import BertTokenizer, TFBertModel
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = TFBertModel.from_pretrained("bert-base-uncased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

### [](#limitations-and-bias)Limitations and bias

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
    >>> unmasker("The man worked as a [MASK].")
    
    [{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
      'score': 0.09747550636529922,
      'token': 10533,
      'token_str': 'carpenter'},
     {'sequence': '[CLS] the man worked as a waiter. [SEP]',
      'score': 0.0523831807076931,
      'token': 15610,
      'token_str': 'waiter'},
     {'sequence': '[CLS] the man worked as a barber. [SEP]',
      'score': 0.04962705448269844,
      'token': 13362,
      'token_str': 'barber'},
     {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
      'score': 0.03788609802722931,
      'token': 15893,
      'token_str': 'mechanic'},
     {'sequence': '[CLS] the man worked as a salesman. [SEP]',
      'score': 0.037680890411138535,
      'token': 18968,
      'token_str': 'salesman'}]
    
    >>> unmasker("The woman worked as a [MASK].")
    
    [{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
      'score': 0.21981462836265564,
      'token': 6821,
      'token_str': 'nurse'},
     {'sequence': '[CLS] the woman worked as a waitress. [SEP]',
      'score': 0.1597415804862976,
      'token': 13877,
      'token_str': 'waitress'},
     {'sequence': '[CLS] the woman worked as a maid. [SEP]',
      'score': 0.1154729500412941,
      'token': 10850,
      'token_str': 'maid'},
     {'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
      'score': 0.037968918681144714,
      'token': 19215,
      'token_str': 'prostitute'},
     {'sequence': '[CLS] the woman worked as a cook. [SEP]',
      'score': 0.03042375110089779,
      'token': 5660,
      'token_str': 'cook'}]
    

This bias will also affect all fine-tuned versions of this model.

[](#training-data)Training data
-------------------------------

The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers).

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form:

    [CLS] Sentence A [SEP] Sentence B [SEP]
    

With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens.

The details of the masking procedure for each sentence are the following:

*   15% of the tokens are masked.
*   In 80% of the cases, the masked tokens are replaced by `[MASK]`.
*   In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
*   In the 10% remaining cases, the masked tokens are left as is.

### [](#pretraining)Pretraining

The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, 1\=0.9\\beta\_{1} = 0.91\=0.9 and 2\=0.999\\beta\_{2} = 0.9992\=0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.

[](#evaluation-results)Evaluation results
-----------------------------------------

When fine-tuned on downstream tasks, this model achieves the following results:

Glue test results:

Task

MNLI-(m/mm)

QQP

QNLI

SST-2

CoLA

STS-B

MRPC

RTE

Average

84.6/83.4

71.2

90.5

93.5

52.1

85.8

88.9

66.4

79.6

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-1810-04805,
      author    = {Jacob Devlin and
                   Ming{-}Wei Chang and
                   Kenton Lee and
                   Kristina Toutanova},
      title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
                   Understanding},
      journal   = {CoRR},
      volume    = {abs/1810.04805},
      year      = {2018},
      url       = {http://arxiv.org/abs/1810.04805},
      archivePrefix = {arXiv},
      eprint    = {1810.04805},
      timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
      biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

[![](https://cdn-media.huggingface.co/exbert/button.png)](https://huggingface.co/exbert/?model=bert-base-uncased)

## Model overview

The `bert-base-uncased` model is a pre-trained BERT model from Google that was trained on a large corpus of English data using a masked language modeling (MLM) objective. It is the base version of the BERT model, which comes in both base and large variations. The uncased model does not differentiate between upper and lower case English text.

The `bert-base-uncased` model demonstrates strong performance on a variety of NLP tasks, such as text classification, question answering, and named entity recognition. It can be fine-tuned on specific datasets for improved performance on downstream tasks. Similar models like [distilbert-base-cased-distilled-squad](https://aimodels.fyi/models/huggingFace/distilbert-base-cased-distilled-squad-distilbert) have been trained by distilling knowledge from BERT to create a smaller, faster model.

## Model inputs and outputs

### Inputs
- **Text Sequences**: The `bert-base-uncased` model takes in text sequences as input, typically in the form of tokenized and padded sequences of token IDs.

### Outputs
- **Token-Level Logits**: The model outputs token-level logits, which can be used for tasks like masked language modeling or sequence classification.
- **Sequence-Level Representations**: The model also produces sequence-level representations that can be used as features for downstream tasks.

## Capabilities

The `bert-base-uncased` model is a powerful language understanding model that can be used for a wide variety of NLP tasks. It has demonstrated strong performance on benchmarks like GLUE, and can be effectively fine-tuned for specific applications. For example, the model can be used for text classification, named entity recognition, question answering, and more.

## What can I use it for?

The `bert-base-uncased` model can be used as a starting point for building NLP applications in a variety of domains. For example, you could fine-tune the model on a dataset of product reviews to build a sentiment analysis system. Or you could use the model to power a question answering system for an FAQ website. The model's versatility makes it a valuable tool for many NLP use cases.

## Things to try

One interesting thing to try with the `bert-base-uncased` model is to explore how its performance varies across different types of text. For example, you could fine-tune the model on specialized domains like legal or medical text and see how it compares to its general performance on benchmarks. Additionally, you could experiment with different fine-tuning strategies, such as using different learning rates or regularization techniques, to further optimize the model's performance for your specific use case.

[](#bert-base-chinese)Bert-base-chinese
=======================================

[](#table-of-contents)Table of Contents
---------------------------------------

*   [Model Details](#model-details)
*   [Uses](#uses)
*   [Risks, Limitations and Biases](#risks-limitations-and-biases)
*   [Training](#training)
*   [Evaluation](#evaluation)
*   [How to Get Started With the Model](#how-to-get-started-with-the-model)

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

This model has been pre-trained for Chinese, training and random input masking has been applied independently to word pieces (as in the original BERT paper).

*   **Developed by:** HuggingFace team
*   **Model Type:** Fill-Mask
*   **Language(s):** Chinese
*   **License:** \[More Information needed\]
*   **Parent Model:** See the [BERT base uncased model](https://huggingface.co/bert-base-uncased) for more information about the BERT base model.

### [](#model-sources)Model Sources

*   **Paper:** [BERT](https://arxiv.org/abs/1810.04805)

[](#uses)Uses
-------------

#### [](#direct-use)Direct Use

This model can be used for masked language modeling

[](#risks-limitations-and-biases)Risks, Limitations and Biases
--------------------------------------------------------------

**CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**

Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).

[](#training)Training
---------------------

#### [](#training-procedure)Training Procedure

*   **type\_vocab\_size:** 2
*   **vocab\_size:** 21128
*   **num\_hidden\_layers:** 12

#### [](#training-data)Training Data

\[More Information Needed\]

[](#evaluation)Evaluation
-------------------------

#### [](#results)Results

\[More Information Needed\]

[](#how-to-get-started-with-the-model)How to Get Started With the Model
-----------------------------------------------------------------------

    from transformers import AutoTokenizer, AutoModelForMaskedLM
    
    tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
    
    model = AutoModelForMaskedLM.from_pretrained("bert-base-chinese")

## Model overview

The `bert-base-chinese` model is a version of the BERT base model that has been pre-trained on Chinese text. It was developed by the HuggingFace team and is based on the original [BERT paper](https://arxiv.org/abs/1810.04805). This model can be used for masked language modeling, where the model predicts missing words in a text. 

Similar models include the [BERT base uncased](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert), [BERT multilingual base uncased](https://aimodels.fyi/models/huggingFace/bert-base-multilingual-uncased-google-bert), and [BERT base cased](https://aimodels.fyi/models/huggingFace/bert-base-cased-google-bert) models, which are trained on English text in different casing configurations. The [BERT large uncased](https://aimodels.fyi/models/huggingFace/bert-large-uncased-google-bert) model is a larger version of the BERT base model.

## Model inputs and outputs

### Inputs
- **Text**: The model takes Chinese text as input, which can contain masked tokens for the model to predict.

### Outputs
- **Predicted tokens**: The model outputs a probability distribution over possible tokens to fill the masked positions in the input text.

## Capabilities

The `bert-base-chinese` model can be used for masked language modeling on Chinese text. This allows the model to learn a rich representation of the Chinese language, which can then be used as a starting point for fine-tuning on downstream tasks such as text classification, named entity recognition, or question answering.

## What can I use it for?

The `bert-base-chinese` model can be used as a foundation for building natural language processing applications for the Chinese language. For example, you could fine-tune the model on a dataset of Chinese product reviews to build a sentiment analysis system. Or you could use the model to extract named entities from Chinese news articles. The rich language understanding capabilities of BERT make it a powerful starting point for a wide range of Chinese NLP tasks.

## Things to try

One interesting thing to try with the `bert-base-chinese` model is to compare its performance on Chinese language tasks to that of the multilingual BERT model. Since the multilingual BERT was trained on data from many languages, it may have a more general understanding of language, while the `bert-base-chinese` model may be more specialized for Chinese. Experimenting with these models on your specific Chinese NLP task could yield interesting insights.

[](#bert-multilingual-base-model-cased)BERT multilingual base model (cased)
===========================================================================

Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is case sensitive: it makes a difference between english and English.

Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

*   Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
*   Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.

This way, the model learns an inner representation of the languages in the training set that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

### [](#how-to-use)How to use

You can use this model directly with a pipeline for masked language modeling:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='bert-base-multilingual-cased')
    >>> unmasker("Hello I'm a [MASK] model.")
    
    [{'sequence': "[CLS] Hello I'm a model model. [SEP]",
      'score': 0.10182085633277893,
      'token': 13192,
      'token_str': 'model'},
     {'sequence': "[CLS] Hello I'm a world model. [SEP]",
      'score': 0.052126359194517136,
      'token': 11356,
      'token_str': 'world'},
     {'sequence': "[CLS] Hello I'm a data model. [SEP]",
      'score': 0.048930276185274124,
      'token': 11165,
      'token_str': 'data'},
     {'sequence': "[CLS] Hello I'm a flight model. [SEP]",
      'score': 0.02036019042134285,
      'token': 23578,
      'token_str': 'flight'},
     {'sequence': "[CLS] Hello I'm a business model. [SEP]",
      'score': 0.020079681649804115,
      'token': 14155,
      'token_str': 'business'}]
    

Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
    model = BertModel.from_pretrained("bert-base-multilingual-cased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

and in TensorFlow:

    from transformers import BertTokenizer, TFBertModel
    tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
    model = TFBertModel.from_pretrained("bert-base-multilingual-cased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

[](#training-data)Training data
-------------------------------

The BERT model was pretrained on the 104 languages with the largest Wikipedias. You can find the complete list [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of 110,000. The languages with a larger Wikipedia are under-sampled and the ones with lower resources are oversampled. For languages like Chinese, Japanese Kanji and Korean Hanja that don't have space, a CJK Unicode block is added around every character.

The inputs of the model are then of the form:

    [CLS] Sentence A [SEP] Sentence B [SEP]
    

With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens.

The details of the masking procedure for each sentence are the following:

*   15% of the tokens are masked.
*   In 80% of the cases, the masked tokens are replaced by `[MASK]`.
*   In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
*   In the 10% remaining cases, the masked tokens are left as is.

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-1810-04805,
      author    = {Jacob Devlin and
                   Ming{-}Wei Chang and
                   Kenton Lee and
                   Kristina Toutanova},
      title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
                   Understanding},
      journal   = {CoRR},
      volume    = {abs/1810.04805},
      year      = {2018},
      url       = {http://arxiv.org/abs/1810.04805},
      archivePrefix = {arXiv},
      eprint    = {1810.04805},
      timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
      biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }

## Model overview

The `bert-base-multilingual-cased` model is a multilingual BERT model trained on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. It was introduced in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)" and first released in the [google-research/bert repository](https://github.com/google-research/bert). This cased model differs from the uncased version in that it maintains the distinction between uppercase and lowercase letters.

BERT is a transformer-based model that was pretrained in a self-supervised manner on a large corpus of text data, without any human labeling. It was trained using two main objectives: masked language modeling, where the model must predict masked words in the input, and next sentence prediction, where the model predicts if two sentences were originally next to each other. This allows BERT to learn rich contextual representations of language that can be leveraged for a variety of downstream tasks.

The `bert-base-multilingual-cased` model is part of a family of BERT models, including the `bert-base-multilingual-uncased`, `bert-base-cased`, and `bert-base-uncased` variants. These models differ in the language(s) they were trained on and whether they preserve case distinctions.

## Model inputs and outputs

### Inputs
- **Text**: The model takes in raw text as input, which is tokenized and converted to token IDs that the model can process.

### Outputs
- **Masked token predictions**: The model can be used to predict the masked tokens in an input sequence.
- **Next sentence prediction**: The model can classify whether two input sentences were originally adjacent in the training data.
- **Contextual embeddings**: The model can produce contextual embeddings for each token in the input, which can be used as features for downstream tasks.

## Capabilities

The `bert-base-multilingual-cased` model is capable of understanding text in over 100 languages, making it useful for a wide range of multilingual applications. It can be used for tasks such as text classification, question answering, and named entity recognition, among others.

One key capability of this model is its ability to capture the nuanced meanings of words by considering the full context of a sentence, rather than just looking at individual words. This allows it to better understand the semantics of language compared to more traditional approaches.

## What can I use it for?

The `bert-base-multilingual-cased` model is primarily intended to be fine-tuned on downstream tasks, rather than used directly for tasks like text generation. You can find fine-tuned versions of this model on the [Hugging Face Model Hub](https://huggingface.co/models?filter=bert) for a variety of tasks that may be of interest.

Some potential use cases for this model include:

- **Multilingual text classification**: Classifying documents or passages of text in multiple languages.
- **Multilingual question answering**: Answering questions based on provided context, in multiple languages.
- **Multilingual named entity recognition**: Identifying and extracting named entities (e.g., people, organizations, locations) in text across languages.

## Things to try

One interesting thing to try with the `bert-base-multilingual-cased` model is to explore how its performance varies across different languages. Since it was trained on a diverse set of languages, it may exhibit varying levels of capability depending on the specific language and task.

Another interesting experiment would be to compare the model's performance to the `bert-base-multilingual-uncased` variant, which does not preserve case distinctions. This could provide insights into how important case information is for certain multilingual language tasks.

Overall, the `bert-base-multilingual-cased` model is a powerful multilingual language model that can be leveraged for a wide range of applications across many languages.

[](#bert-base-model-cased)BERT base model (cased)
=================================================

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is case-sensitive: it makes a difference between english and English.

Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

*   Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
*   Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

### [](#how-to-use)How to use

You can use this model directly with a pipeline for masked language modeling:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='bert-base-cased')
    >>> unmasker("Hello I'm a [MASK] model.")
    
    [{'sequence': "[CLS] Hello I'm a fashion model. [SEP]",
      'score': 0.09019174426794052,
      'token': 4633,
      'token_str': 'fashion'},
     {'sequence': "[CLS] Hello I'm a new model. [SEP]",
      'score': 0.06349995732307434,
      'token': 1207,
      'token_str': 'new'},
     {'sequence': "[CLS] Hello I'm a male model. [SEP]",
      'score': 0.06228214129805565,
      'token': 2581,
      'token_str': 'male'},
     {'sequence': "[CLS] Hello I'm a professional model. [SEP]",
      'score': 0.0441727414727211,
      'token': 1848,
      'token_str': 'professional'},
     {'sequence': "[CLS] Hello I'm a super model. [SEP]",
      'score': 0.03326151892542839,
      'token': 7688,
      'token_str': 'super'}]
    

Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
    model = BertModel.from_pretrained("bert-base-cased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

and in TensorFlow:

    from transformers import BertTokenizer, TFBertModel
    tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
    model = TFBertModel.from_pretrained("bert-base-cased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

### [](#limitations-and-bias)Limitations and bias

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='bert-base-cased')
    >>> unmasker("The man worked as a [MASK].")
    
    [{'sequence': '[CLS] The man worked as a lawyer. [SEP]',
      'score': 0.04804691672325134,
      'token': 4545,
      'token_str': 'lawyer'},
     {'sequence': '[CLS] The man worked as a waiter. [SEP]',
      'score': 0.037494491785764694,
      'token': 17989,
      'token_str': 'waiter'},
     {'sequence': '[CLS] The man worked as a cop. [SEP]',
      'score': 0.035512614995241165,
      'token': 9947,
      'token_str': 'cop'},
     {'sequence': '[CLS] The man worked as a detective. [SEP]',
      'score': 0.031271643936634064,
      'token': 9140,
      'token_str': 'detective'},
     {'sequence': '[CLS] The man worked as a doctor. [SEP]',
      'score': 0.027423162013292313,
      'token': 3995,
      'token_str': 'doctor'}]
    
    >>> unmasker("The woman worked as a [MASK].")
    
    [{'sequence': '[CLS] The woman worked as a nurse. [SEP]',
      'score': 0.16927455365657806,
      'token': 7439,
      'token_str': 'nurse'},
     {'sequence': '[CLS] The woman worked as a waitress. [SEP]',
      'score': 0.1501094549894333,
      'token': 15098,
      'token_str': 'waitress'},
     {'sequence': '[CLS] The woman worked as a maid. [SEP]',
      'score': 0.05600163713097572,
      'token': 13487,
      'token_str': 'maid'},
     {'sequence': '[CLS] The woman worked as a housekeeper. [SEP]',
      'score': 0.04838843643665314,
      'token': 26458,
      'token_str': 'housekeeper'},
     {'sequence': '[CLS] The woman worked as a cook. [SEP]',
      'score': 0.029980547726154327,
      'token': 9834,
      'token_str': 'cook'}]
    

This bias will also affect all fine-tuned versions of this model.

[](#training-data)Training data
-------------------------------

The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers).

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The texts are tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form:

    [CLS] Sentence A [SEP] Sentence B [SEP]
    

With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens.

The details of the masking procedure for each sentence are the following:

*   15% of the tokens are masked.
*   In 80% of the cases, the masked tokens are replaced by `[MASK]`.
*   In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
*   In the 10% remaining cases, the masked tokens are left as is.

### [](#pretraining)Pretraining

The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, 1\=0.9\\beta\_{1} = 0.91\=0.9 and 2\=0.999\\beta\_{2} = 0.9992\=0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.

[](#evaluation-results)Evaluation results
-----------------------------------------

When fine-tuned on downstream tasks, this model achieves the following results:

Glue test results:

Task

MNLI-(m/mm)

QQP

QNLI

SST-2

CoLA

STS-B

MRPC

RTE

Average

84.6/83.4

71.2

90.5

93.5

52.1

85.8

88.9

66.4

79.6

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-1810-04805,
      author    = {Jacob Devlin and
                   Ming{-}Wei Chang and
                   Kenton Lee and
                   Kristina Toutanova},
      title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
                   Understanding},
      journal   = {CoRR},
      volume    = {abs/1810.04805},
      year      = {2018},
      url       = {http://arxiv.org/abs/1810.04805},
      archivePrefix = {arXiv},
      eprint    = {1810.04805},
      timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
      biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

[![](https://cdn-media.huggingface.co/exbert/button.png)](https://huggingface.co/exbert/?model=bert-base-cased)

## Model overview

The `bert-base-cased` model is a base-sized BERT model that has been pre-trained on a large corpus of English text using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is case-sensitive, meaning it can distinguish between words like "english" and "English". 

The BERT model learns a bidirectional representation of text by randomly masking 15% of the words in the input and then training the model to predict those masked words. This is different from traditional language models that process text sequentially. By learning to predict masked words in their full context, BERT can capture deeper semantic relationships in the text.

Compared to similar models like [`bert-base-uncased`](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert), the `bert-base-cased` model preserves capitalization information, which can be useful for tasks like named entity recognition. The [`distilbert-base-uncased`](https://aimodels.fyi/models/huggingFace/distilbert-base-uncased-distilbert) model is a compressed, faster version of BERT that was trained to mimic the behavior of the original BERT base model. The [`xlm-roberta-base`](https://aimodels.fyi/models/huggingFace/xlm-roberta-base-facebookai) model is a multilingual version of RoBERTa, capable of understanding 100 different languages.

## Model inputs and outputs

### Inputs
- **Text**: The model takes raw text as input, which is tokenized and converted to token IDs that the model can process.

### Outputs
- **Masked word predictions**: When used for masked language modeling, the model outputs probability distributions over the vocabulary for each masked token in the input.
- **Sequence classifications**: When fine-tuned on downstream tasks, the model can output classifications for the entire input sequence, such as sentiment analysis or text categorization.
- **Token classifications**: The model can also be fine-tuned to output classifications for individual tokens in the sequence, such as named entity recognition.

## Capabilities

The `bert-base-cased` model is particularly well-suited for tasks that require understanding the full context of a piece of text, such as sentiment analysis, text classification, and question answering. Its bidirectional nature allows it to capture nuanced relationships between words that sequential models may miss.

For example, the model can be used to classify whether a restaurant review is positive or negative, even if the review contains negation (e.g. "The food was not good"). By considering the entire context of the sentence, the model can understand that the reviewer is expressing a negative sentiment.

## What can I use it for?

The `bert-base-cased` model is a versatile base model that can be fine-tuned for a wide variety of natural language processing tasks. Some potential use cases include:

- **Text classification**: Classify documents, emails, or social media posts into categories like sentiment, topic, or intent.
- **Named entity recognition**: Identify and extract entities like people, organizations, and locations from text.
- [**Question answering**](https://aimodels.fyi/models/huggingFace/bert-base-cased-google-bert): Build a system that can answer questions by understanding the context of a given passage.
- **Summarization**: Generate concise summaries of long-form text.

Companies could leverage the model's capabilities to build intelligent chatbots, content moderation systems, or automated customer service applications.

## Things to try

One interesting aspect of the `bert-base-cased` model is its ability to capture nuanced relationships between words, even across long-range dependencies. For example, try using the model to classify the sentiment of reviews that contain negation or sarcasm. You may find that it performs better than simpler models that only consider the individual words in isolation.

Another interesting experiment would be to compare the performance of the `bert-base-cased` model to the `bert-base-uncased` model on tasks where capitalization is important, such as named entity recognition. The cased model may be better able to distinguish between proper nouns and common nouns, leading to improved performance.

[](#bert-large-model-uncased-whole-word-masking-finetuned-on-squad)BERT large model (uncased) whole word masking finetuned on SQuAD
===================================================================================================================================

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is uncased: it does not make a difference between english and English.

Differently to other BERT models, this model was trained with a new technique: Whole Word Masking. In this case, all of the tokens corresponding to a word are masked at once. The overall masking rate remains the same.

The training is identical -- each masked WordPiece token is predicted independently.

After pre-training, this model was fine-tuned on the SQuAD dataset with one of our fine-tuning scripts. See below for more information regarding this fine-tuning.

Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

*   Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
*   Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

This model has the following configuration:

*   24-layer
*   1024 hidden dimension
*   16 attention heads
*   336M parameters.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

This model should be used as a question-answering model. You may use it in a question answering pipeline, or use it to output raw results given a query and a context. You may see other use cases in the [task summary](https://huggingface.co/transformers/task_summary.html#extractive-question-answering) of the transformers documentation.## Training data

The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers).

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form:

    [CLS] Sentence A [SEP] Sentence B [SEP]
    

With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens.

The details of the masking procedure for each sentence are the following:

*   15% of the tokens are masked.
*   In 80% of the cases, the masked tokens are replaced by `[MASK]`.
*   In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
*   In the 10% remaining cases, the masked tokens are left as is.

### [](#pretraining)Pretraining

The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, 1\=0.9\\beta\_{1} = 0.91\=0.9 and 2\=0.999\\beta\_{2} = 0.9992\=0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.

### [](#fine-tuning)Fine-tuning

After pre-training, this model was fine-tuned on the SQuAD dataset with one of our fine-tuning scripts. In order to reproduce the training, you may use the following command:

    python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_qa.py \
        --model_name_or_path bert-large-uncased-whole-word-masking \
        --dataset_name squad \
        --do_train \
        --do_eval \
        --learning_rate 3e-5 \
        --num_train_epochs 2 \
        --max_seq_length 384 \
        --doc_stride 128 \
        --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
        --per_device_eval_batch_size=3   \
        --per_device_train_batch_size=3   \
    

[](#evaluation-results)Evaluation results
-----------------------------------------

The results obtained are the following:

    f1 = 93.15
    exact_match = 86.91
    

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-1810-04805,
      author    = {Jacob Devlin and
                   Ming{-}Wei Chang and
                   Kenton Lee and
                   Kristina Toutanova},
      title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
                   Understanding},
      journal   = {CoRR},
      volume    = {abs/1810.04805},
      year      = {2018},
      url       = {http://arxiv.org/abs/1810.04805},
      archivePrefix = {arXiv},
      eprint    = {1810.04805},
      timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
      biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }

## Model overview

The `bert-large-uncased-whole-word-masking-finetuned-squad` model is a version of the [BERT](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert) large model that has been fine-tuned on the SQuAD dataset. BERT is a transformers model that was pretrained on a large corpus of English data using a masked language modeling (MLM) objective. This means the model was trained to predict masked words in a sentence, allowing it to learn a bidirectional representation of the language. 

The key difference for this specific model is that it was trained using "whole word masking" instead of the standard subword masking. In whole word masking, all tokens corresponding to a single word are masked together, rather than masking individual subwords. This change was found to improve the model's performance on certain tasks.

After pretraining, this model was further fine-tuned on the SQuAD question-answering dataset. SQuAD contains reading comprehension questions based on Wikipedia articles, so this additional fine-tuning allows the model to excel at question-answering tasks.

## Model inputs and outputs

### Inputs
- **Text**: The model takes text as input, which can be a single passage, or a pair of sentences (e.g. a question and a passage containing the answer).

### Outputs
- **Predicted answer**: For question-answering tasks, the model outputs the text span from the input passage that answers the given question.
- **Confidence score**: The model also provides a confidence score for the predicted answer.

## Capabilities

The `bert-large-uncased-whole-word-masking-finetuned-squad` model is highly capable at question-answering tasks, thanks to its pretraining on large text corpora and fine-tuning on the SQuAD dataset. It can accurately extract relevant answer spans from input passages given natural language questions.

For example, given the question "What is the capital of France?" and a passage about European countries, the model would correctly identify "Paris" as the answer. Or for a more complex question like "When was the first mouse invented?", the model could locate the relevant information in a passage and provide the appropriate answer.

## What can I use it for?

This model is well-suited for building question-answering applications, such as chatbots, virtual assistants, or knowledge retrieval systems. By fine-tuning the model on domain-specific data, you can create specialized question-answering capabilities tailored to your use case.

For example, you could fine-tune the model on a corpus of medical literature to build a virtual assistant that can answer questions about health and treatments. Or fine-tune it on technical documentation to create a tool that helps users find answers to their questions about a product or service.

## Things to try

One interesting aspect of this model is its use of whole word masking during pretraining. This technique has been shown to improve the model's understanding of word relationships and its ability to reason about complete concepts, rather than just individual subwords.

To see this in action, you could try providing the model with questions that require some level of reasoning or common sense, beyond just literal text matching. See how the model performs on questions that involve inference, analogy, or understanding broader context.

Additionally, you could experiment with fine-tuning the model on different question-answering datasets, or even combine it with other techniques like data augmentation, to further enhance its capabilities for your specific use case.

[](#bert-large-model-uncased)BERT large model (uncased)
=======================================================

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is uncased: it does not make a difference between english and English.

Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

*   Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
*   Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

This model has the following configuration:

*   24-layer
*   1024 hidden dimension
*   16 attention heads
*   336M parameters.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

### [](#how-to-use)How to use

You can use this model directly with a pipeline for masked language modeling:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='bert-large-uncased')
    >>> unmasker("Hello I'm a [MASK] model.")
    [{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
      'score': 0.1886913776397705,
      'token': 4827,
      'token_str': 'fashion'},
     {'sequence': "[CLS] hello i'm a professional model. [SEP]",
      'score': 0.07157472521066666,
      'token': 2658,
      'token_str': 'professional'},
     {'sequence': "[CLS] hello i'm a male model. [SEP]",
      'score': 0.04053466394543648,
      'token': 3287,
      'token_str': 'male'},
     {'sequence': "[CLS] hello i'm a role model. [SEP]",
      'score': 0.03891477733850479,
      'token': 2535,
      'token_str': 'role'},
     {'sequence': "[CLS] hello i'm a fitness model. [SEP]",
      'score': 0.03038121573626995,
      'token': 10516,
      'token_str': 'fitness'}]
    

Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
    model = BertModel.from_pretrained("bert-large-uncased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

and in TensorFlow:

    from transformers import BertTokenizer, TFBertModel
    tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
    model = TFBertModel.from_pretrained("bert-large-uncased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

### [](#limitations-and-bias)Limitations and bias

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='bert-large-uncased')
    >>> unmasker("The man worked as a [MASK].")
    
    [{'sequence': '[CLS] the man worked as a bartender. [SEP]',
      'score': 0.10426565259695053,
      'token': 15812,
      'token_str': 'bartender'},
     {'sequence': '[CLS] the man worked as a waiter. [SEP]',
      'score': 0.10232779383659363,
      'token': 15610,
      'token_str': 'waiter'},
     {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
      'score': 0.06281787157058716,
      'token': 15893,
      'token_str': 'mechanic'},
     {'sequence': '[CLS] the man worked as a lawyer. [SEP]',
      'score': 0.050936125218868256,
      'token': 5160,
      'token_str': 'lawyer'},
     {'sequence': '[CLS] the man worked as a carpenter. [SEP]',
      'score': 0.041034240275621414,
      'token': 10533,
      'token_str': 'carpenter'}]
    
    >>> unmasker("The woman worked as a [MASK].")
    
    [{'sequence': '[CLS] the woman worked as a waitress. [SEP]',
      'score': 0.28473711013793945,
      'token': 13877,
      'token_str': 'waitress'},
     {'sequence': '[CLS] the woman worked as a nurse. [SEP]',
      'score': 0.11336520314216614,
      'token': 6821,
      'token_str': 'nurse'},
     {'sequence': '[CLS] the woman worked as a bartender. [SEP]',
      'score': 0.09574324637651443,
      'token': 15812,
      'token_str': 'bartender'},
     {'sequence': '[CLS] the woman worked as a maid. [SEP]',
      'score': 0.06351090222597122,
      'token': 10850,
      'token_str': 'maid'},
     {'sequence': '[CLS] the woman worked as a secretary. [SEP]',
      'score': 0.048970773816108704,
      'token': 3187,
      'token_str': 'secretary'}]
    

This bias will also affect all fine-tuned versions of this model.

[](#training-data)Training data
-------------------------------

The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers).

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form:

    [CLS] Sentence A [SEP] Sentence B [SEP]
    

With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens.

The details of the masking procedure for each sentence are the following:

*   15% of the tokens are masked.
*   In 80% of the cases, the masked tokens are replaced by `[MASK]`.
*   In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
*   In the 10% remaining cases, the masked tokens are left as is.

### [](#pretraining)Pretraining

The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, 1\=0.9\\beta\_{1} = 0.91\=0.9 and 2\=0.999\\beta\_{2} = 0.9992\=0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.

[](#evaluation-results)Evaluation results
-----------------------------------------

When fine-tuned on downstream tasks, this model achieves the following results:

Model

SQUAD 1.1 F1/EM

Multi NLI Accuracy

BERT-Large, Uncased (Original)

91.0/84.3

86.05

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-1810-04805,
      author    = {Jacob Devlin and
                   Ming{-}Wei Chang and
                   Kenton Lee and
                   Kristina Toutanova},
      title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
                   Understanding},
      journal   = {CoRR},
      volume    = {abs/1810.04805},
      year      = {2018},
      url       = {http://arxiv.org/abs/1810.04805},
      archivePrefix = {arXiv},
      eprint    = {1810.04805},
      timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
      biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }

## Model overview

The `bert-large-uncased` model is a large, 24-layer BERT model that was pre-trained on a large corpus of English data using a masked language modeling (MLM) objective. Unlike the [BERT base model](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert), this larger model has 1024 hidden dimensions and 16 attention heads, for a total of 336M parameters. 

BERT is a transformer-based model that learns a deep, bidirectional representation of language by predicting masked tokens in an input sentence. During pre-training, the model also learns to predict whether two sentences were originally consecutive or not. This allows BERT to capture rich contextual information that can be leveraged for downstream tasks.

## Model inputs and outputs

### Inputs
- **Text**: BERT models accept text as input, with the input typically formatted as a sequence of tokens separated by special tokens like `[CLS]` and `[SEP]`.
- **Masked tokens**: BERT models are designed to handle input with randomly masked tokens, which the model must then predict.

### Outputs
- **Predicted masked tokens**: Given an input sequence with masked tokens, BERT outputs a probability distribution over the vocabulary for each masked position, allowing you to predict the missing words.
- **Sequence representations**: BERT can also be used to extract contextual representations of the input sequence, which can be useful features for downstream tasks like classification or question answering.

## Capabilities

The `bert-large-uncased` model is a powerful language understanding model that can be fine-tuned on a wide range of NLP tasks. It has shown strong performance on benchmarks like GLUE, outperforming many previous state-of-the-art models. Some key capabilities of this model include:

- **Masked language modeling**: The model can accurately predict masked tokens in an input sequence, demonstrating its deep understanding of language.
- **Sentence-level understanding**: The model can reason about the relationship between two sentences, as evidenced by its strong performance on the next sentence prediction task during pre-training.
- **Transfer learning**: The rich contextual representations learned by BERT can be effectively leveraged for fine-tuning on downstream tasks, even with relatively small amounts of labeled data.

## What can I use it for?

The `bert-large-uncased` model is primarily intended to be fine-tuned on a wide variety of downstream NLP tasks, such as:

- **Text classification**: Classifying the sentiment, topic, or other attributes of a piece of text. For example, you could fine-tune the model on a dataset of product reviews and use it to predict the rating of a new review.
- **Question answering**: Extracting the answer to a question from a given context passage. You could fine-tune the model on a dataset like SQuAD and use it to answer questions about a document.
- **Named entity recognition**: Identifying and classifying named entities (e.g. people, organizations, locations) in text. This could be useful for tasks like information extraction.

To use the model for these tasks, you would typically fine-tune the pre-trained BERT weights on your specific dataset and task using one of the many [available fine-tuning examples](https://huggingface.co/transformers/task_summary.html).

## Things to try

One interesting aspect of the `bert-large-uncased` model is its ability to handle longer input sequences, thanks to its large 24-layer architecture. This makes it well-suited for tasks that require understanding of long-form text, such as document classification or multi-sentence question answering.

You could experiment with using this model for tasks that involve processing lengthy inputs, and compare its performance to the [BERT base model](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert) or other large language models. Additionally, you could explore ways to further optimize the model's efficiency, such as by using techniques like distillation or quantization, which can help reduce the model's size and inference time without sacrificing too much performance.

Overall, the `bert-large-uncased` model provides a powerful starting point for a wide range of natural language processing applications.

[](#bert-multilingual-base-model-uncased)BERT multilingual base model (uncased)
===============================================================================

Pretrained model on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is uncased: it does not make a difference between english and English.

Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

*   Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
*   Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.

This way, the model learns an inner representation of the languages in the training set that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

### [](#how-to-use)How to use

You can use this model directly with a pipeline for masked language modeling:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='bert-base-multilingual-uncased')
    >>> unmasker("Hello I'm a [MASK] model.")
    
    [{'sequence': "[CLS] hello i'm a top model. [SEP]",
      'score': 0.1507750153541565,
      'token': 11397,
      'token_str': 'top'},
     {'sequence': "[CLS] hello i'm a fashion model. [SEP]",
      'score': 0.13075384497642517,
      'token': 23589,
      'token_str': 'fashion'},
     {'sequence': "[CLS] hello i'm a good model. [SEP]",
      'score': 0.036272723227739334,
      'token': 12050,
      'token_str': 'good'},
     {'sequence': "[CLS] hello i'm a new model. [SEP]",
      'score': 0.035954564809799194,
      'token': 10246,
      'token_str': 'new'},
     {'sequence': "[CLS] hello i'm a great model. [SEP]",
      'score': 0.028643041849136353,
      'token': 11838,
      'token_str': 'great'}]
    

Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
    model = BertModel.from_pretrained("bert-base-multilingual-uncased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

and in TensorFlow:

    from transformers import BertTokenizer, TFBertModel
    tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
    model = TFBertModel.from_pretrained("bert-base-multilingual-uncased")
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

### [](#limitations-and-bias)Limitations and bias

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions:

    >>> from transformers import pipeline
    >>> unmasker = pipeline('fill-mask', model='bert-base-multilingual-uncased')
    >>> unmasker("The man worked as a [MASK].")
    
    [{'sequence': '[CLS] the man worked as a teacher. [SEP]',
      'score': 0.07943806052207947,
      'token': 21733,
      'token_str': 'teacher'},
     {'sequence': '[CLS] the man worked as a lawyer. [SEP]',
      'score': 0.0629938617348671,
      'token': 34249,
      'token_str': 'lawyer'},
     {'sequence': '[CLS] the man worked as a farmer. [SEP]',
      'score': 0.03367974981665611,
      'token': 36799,
      'token_str': 'farmer'},
     {'sequence': '[CLS] the man worked as a journalist. [SEP]',
      'score': 0.03172805905342102,
      'token': 19477,
      'token_str': 'journalist'},
     {'sequence': '[CLS] the man worked as a carpenter. [SEP]',
      'score': 0.031021825969219208,
      'token': 33241,
      'token_str': 'carpenter'}]
    
    >>> unmasker("The Black woman worked as a [MASK].")
    
    [{'sequence': '[CLS] the black woman worked as a nurse. [SEP]',
      'score': 0.07045423984527588,
      'token': 52428,
      'token_str': 'nurse'},
     {'sequence': '[CLS] the black woman worked as a teacher. [SEP]',
      'score': 0.05178029090166092,
      'token': 21733,
      'token_str': 'teacher'},
     {'sequence': '[CLS] the black woman worked as a lawyer. [SEP]',
      'score': 0.032601192593574524,
      'token': 34249,
      'token_str': 'lawyer'},
     {'sequence': '[CLS] the black woman worked as a slave. [SEP]',
      'score': 0.030507225543260574,
      'token': 31173,
      'token_str': 'slave'},
     {'sequence': '[CLS] the black woman worked as a woman. [SEP]',
      'score': 0.027691684663295746,
      'token': 14050,
      'token_str': 'woman'}]
    

This bias will also affect all fine-tuned versions of this model.

[](#training-data)Training data
-------------------------------

The BERT model was pretrained on the 102 languages with the largest Wikipedias. You can find the complete list [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of 110,000. The languages with a larger Wikipedia are under-sampled and the ones with lower resources are oversampled. For languages like Chinese, Japanese Kanji and Korean Hanja that don't have space, a CJK Unicode block is added around every character.

The inputs of the model are then of the form:

    [CLS] Sentence A [SEP] Sentence B [SEP]
    

With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens.

The details of the masking procedure for each sentence are the following:

*   15% of the tokens are masked.
*   In 80% of the cases, the masked tokens are replaced by `[MASK]`.
*   In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
*   In the 10% remaining cases, the masked tokens are left as is.

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-1810-04805,
      author    = {Jacob Devlin and
                   Ming{-}Wei Chang and
                   Kenton Lee and
                   Kristina Toutanova},
      title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
                   Understanding},
      journal   = {CoRR},
      volume    = {abs/1810.04805},
      year      = {2018},
      url       = {http://arxiv.org/abs/1810.04805},
      archivePrefix = {arXiv},
      eprint    = {1810.04805},
      timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
      biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }

## Model overview

`bert-base-multilingual-uncased` is a BERT model pretrained on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is uncased, meaning it does not differentiate between English and english.

Similar models include the [BERT large uncased model](https://aimodels.fyi/models/huggingFace/bert-large-uncased-google-bert), the [BERT base uncased model](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert), and the [BERT base cased model](https://aimodels.fyi/models/huggingFace/bert-base-cased-google-bert). These models vary in size and language coverage, but all use the same self-supervised pretraining approach.

## Model inputs and outputs

### Inputs
- **Text**: The model takes in text as input, which can be a single sentence or a pair of sentences.

### Outputs
- **Masked token predictions**: The model can be used to predict the masked tokens in an input sequence.
- **Next sentence prediction**: The model can also predict whether two input sentences were originally consecutive or not.

## Capabilities

The `bert-base-multilingual-uncased` model is able to understand and represent text from 102 different languages. This makes it a powerful tool for multilingual text processing tasks such as text classification, named entity recognition, and question answering. By leveraging the knowledge learned from a diverse set of languages during pretraining, the model can effectively transfer to downstream tasks in different languages.

## What can I use it for?

You can fine-tune `bert-base-multilingual-uncased` on a wide variety of multilingual NLP tasks, such as:

- **Text classification**: Categorize text into different classes, e.g. sentiment analysis, topic classification.
- **Named entity recognition**: Identify and extract named entities (people, organizations, locations, etc.) from text.
- **Question answering**: Given a question and a passage of text, extract the answer from the passage.
- **Sequence labeling**: Assign a label to each token in a sequence, e.g. part-of-speech tagging, relation extraction.

See the [model hub](https://huggingface.co/models?filter=bert) to explore fine-tuned versions of the model on specific tasks.

## Things to try

Since `bert-base-multilingual-uncased` is a powerful multilingual model, you can experiment with applying it to a diverse range of multilingual NLP tasks. Try fine-tuning it on your own multilingual datasets or leveraging its capabilities in a multilingual application. Additionally, you can explore how the model's performance varies across different languages and identify any biases or limitations it may have.

[![](https://cdn-media.huggingface.co/exbert/button.png)](https://huggingface.co/exbert/?model=bert-base-german-cased)

[](#german-bert)German BERT
===========================

[![bert_image](https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png)](https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png)

[](#overview)Overview
---------------------

**Language model:** bert-base-cased  
**Language:** German  
**Training data:** Wiki, OpenLegalData, News (~ 12GB)  
**Eval data:** Conll03 (NER), GermEval14 (NER), GermEval18 (Classification), GNAD (Classification)  
**Infrastructure**: 1x TPU v2  
**Published**: Jun 14th, 2019

**Update April 3rd, 2020**: we updated the vocabulary file on deepset's s3 to conform with the default tokenization of punctuation tokens. For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60). If you want to use the old vocab we have also uploaded a ["deepset/bert-base-german-cased-oldvocab"](https://huggingface.co/deepset/bert-base-german-cased-oldvocab) model.

[](#details)Details
-------------------

*   We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.
*   We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
*   As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
*   We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.

See [https://deepset.ai/german-bert](https://deepset.ai/german-bert) for more details

[](#hyperparameters)Hyperparameters
-----------------------------------

    batch_size = 1024
    n_steps = 810_000
    max_seq_len = 128 (and 512 later)
    learning_rate = 1e-4
    lr_schedule = LinearWarmup
    num_warmup_steps = 10_000
    

[](#performance)Performance
---------------------------

During training we monitored the loss and evaluated different model checkpoints on the following German datasets:

*   germEval18Fine: Macro f1 score for multiclass sentiment classification
*   germEval18coarse: Macro f1 score for binary sentiment classification
*   germEval14: Seq f1 score for NER (file names deuutf.\*)
*   CONLL03: Seq f1 score for NER
*   10kGNAD: Accuracy for document classification

Even without thorough hyperparameter tuning, we observed quite stable learning especially for our German model. Multiple restarts with different seeds produced quite similar results.

[![performancetable](https://thumb.tildacdn.com/tild3162-6462-4566-b663-376630376138/-/format/webp/Screenshot_from_2020.png)](https://thumb.tildacdn.com/tild3162-6462-4566-b663-376630376138/-/format/webp/Screenshot_from_2020.png)

We further evaluated different points during the 9 days of pre-training and were astonished how fast the model converges to the maximally reachable performance. We ran all 5 downstream tasks on 7 different model checkpoints - taken at 0 up to 840k training steps (x-axis in figure below). Most checkpoints are taken from early training where we expected most performance changes. Surprisingly, even a randomly initialized BERT can be trained only on labeled downstream datasets and reach good performance (blue line, GermEval 2018 Coarse task, 795 kB trainset size).

[![checkpointseval](https://thumb.tildacdn.com/tild6335-3531-4137-b533-313365663435/-/format/webp/deepset_checkpoints.png)](https://thumb.tildacdn.com/tild6335-3531-4137-b533-313365663435/-/format/webp/deepset_checkpoints.png)

[](#authors)Authors
-------------------

*   Branden Chan: `branden.chan [at] deepset.ai`
*   Timo Mller: `timo.moeller [at] deepset.ai`
*   Malte Pietsch: `malte.pietsch [at] deepset.ai`
*   Tanay Soni: `tanay.soni [at] deepset.ai`

[](#about-us)About us
---------------------

[![deepset logo](https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png)](https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png)

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.

Some of our work:

*   [German BERT (aka "bert-base-german-cased")](https://deepset.ai/german-bert)
*   [FARM](https://github.com/deepset-ai/FARM)
*   [Haystack](https://github.com/deepset-ai/haystack/)

Get in touch: [Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Website](https://deepset.ai)

## Model overview

The `bert-base-german-cased` model is a German-language BERT model developed by the [google-bert](https://aimodels.fyi/creators/huggingFace/google-bert) team. It is based on the BERT base architecture, with some key differences: it was trained on a German corpus including Wikipedia, news articles, and legal data, and it is a cased model that differentiates between uppercase and lowercase.

Compared to similar models like [bert-base-cased](https://aimodels.fyi/models/huggingFace/bert-base-cased-google-bert) and [bert-base-uncased](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert), the `bert-base-german-cased` model is optimized for German language tasks. It was evaluated on various German datasets like GermEval and CONLL03, showing strong performance on named entity recognition and text classification.

## Model inputs and outputs

### Inputs
- **Text**: The model takes in text as input, either in the form of a single sequence or a pair of sequences.
- **Sequence length**: The model supports variable sequence lengths, with a maximum length of 512 tokens.

### Outputs
- **Token embeddings**: The model outputs a sequence of token embeddings, which can be used as features for downstream tasks.
- **Pooled output**: The model also produces a single embedding representing the entire input sequence, which can be useful for classification tasks.

## Capabilities

The `bert-base-german-cased` model is capable of understanding and processing German text, making it well-suited for a variety of German-language NLP tasks. Some key capabilities include:

- **Named Entity Recognition**: The model can identify and classify named entities like people, organizations, locations, and miscellaneous entities in German text.
- **Text Classification**: The model can be fine-tuned for classification tasks like sentiment analysis or document categorization on German data.
- **Question Answering**: The model can be used as the basis for building German-language question answering systems.

## What can I use it for?

The `bert-base-german-cased` model can be used as a starting point for building a wide range of German-language NLP applications. Some potential use cases include:

- **Content Moderation**: Fine-tune the model for detecting hate speech, offensive language, or other undesirable content in German social media posts or online forums.
- **Intelligent Assistants**: Incorporate the model into a German-language virtual assistant to enable natural language understanding and generation.
- **Automated Summarization**: Fine-tune the model for extractive or abstractive summarization of German text, such as news articles or research papers.

## Things to try

Some interesting things to try with the `bert-base-german-cased` model include:

- **Evaluating on additional German datasets**: While the model was evaluated on several standard German NLP benchmarks, there may be opportunities to test its performance on other specialized German datasets or real-world applications.
- **Exploring multilingual fine-tuning**: Since the related [bert-base-multilingual-uncased](https://aimodels.fyi/models/huggingFace/bert-base-multilingual-uncased-google-bert) model was trained on 104 languages, it may be interesting to investigate whether combining the German-specific and multilingual models can lead to improved performance.
- **Investigating model interpretability**: As with other BERT-based models, understanding the internal representations and attention patterns of `bert-base-german-cased` could provide insights into how it processes and understands German language.