[](#multilingual-e5-small)Multilingual-E5-small
-----------------------------------------------

[Multilingual E5 Text Embeddings: A Technical Report](https://arxiv.org/pdf/2402.05672). Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, arXiv 2024

This model has 12 layers and the embedding size is 384.

[](#usage)Usage
---------------

Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.

    import torch.nn.functional as F
    
    from torch import Tensor
    from transformers import AutoTokenizer, AutoModel
    
    
    def average_pool(last_hidden_states: Tensor,
                     attention_mask: Tensor) -> Tensor:
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    
    
    # Each input text should start with "query: " or "passage: ", even for non-English texts.
    # For tasks other than retrieval, you can simply use the "query: " prefix.
    input_texts = ['query: how much protein should a female eat',
                   'query: ',
                   "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
                   "passage: 1. : : : 1, 2() 3, 4, 2. :1 : : 1, 28, 3,, 4,, 5, 6, 7,"]
    
    tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-small')
    model = AutoModel.from_pretrained('intfloat/multilingual-e5-small')
    
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
    
    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    # normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    scores = (embeddings[:2] @ embeddings[2:].T) * 100
    print(scores.tolist())
    

[](#supported-languages)Supported Languages
-------------------------------------------

This model is initialized from [microsoft/Multilingual-MiniLM-L12-H384](https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384) and continually trained on a mixture of multilingual datasets. It supports 100 languages from xlm-roberta, but low-resource languages may see performance degradation.

[](#training-details)Training Details
-------------------------------------

**Initialization**: [microsoft/Multilingual-MiniLM-L12-H384](https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384)

**First stage**: contrastive pre-training with weak supervision

Dataset

Weak supervision

\# of text pairs

Filtered [mC4](https://huggingface.co/datasets/mc4)

(title, page content)

1B

[CC News](https://huggingface.co/datasets/intfloat/multilingual_cc_news)

(title, news content)

400M

[NLLB](https://huggingface.co/datasets/allenai/nllb)

translation pairs

2.4B

[Wikipedia](https://huggingface.co/datasets/intfloat/wikipedia)

(hierarchical section title, passage)

150M

Filtered [Reddit](https://www.reddit.com/)

(comment, response)

800M

[S2ORC](https://github.com/allenai/s2orc)

(title, abstract) and citation pairs

100M

[Stackexchange](https://stackexchange.com/)

(question, answer)

50M

[xP3](https://huggingface.co/datasets/bigscience/xP3)

(input prompt, response)

80M

[Miscellaneous unsupervised SBERT data](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

\-

10M

**Second stage**: supervised fine-tuning

Dataset

Language

\# of text pairs

[MS MARCO](https://microsoft.github.io/msmarco/)

English

500k

[NQ](https://github.com/facebookresearch/DPR)

English

70k

[Trivia QA](https://github.com/facebookresearch/DPR)

English

60k

[NLI from SimCSE](https://github.com/princeton-nlp/SimCSE)

English

<300k

[ELI5](https://huggingface.co/datasets/eli5)

English

500k

[DuReader Retrieval](https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval)

Chinese

86k

[KILT Fever](https://huggingface.co/datasets/kilt_tasks)

English

70k

[KILT HotpotQA](https://huggingface.co/datasets/kilt_tasks)

English

70k

[SQuAD](https://huggingface.co/datasets/squad)

English

87k

[Quora](https://huggingface.co/datasets/quora)

English

150k

[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi)

11 languages

50k

[MIRACL](https://huggingface.co/datasets/miracl/miracl)

16 languages

40k

For all labeled datasets, we only use its training set for fine-tuning.

For other training details, please refer to our paper at [https://arxiv.org/pdf/2402.05672](https://arxiv.org/pdf/2402.05672).

[](#benchmark-results-on-mr-tydi)Benchmark Results on [Mr. TyDi](https://arxiv.org/abs/2108.08787)
--------------------------------------------------------------------------------------------------

Model

Avg MRR@10

ar

bn

en

fi

id

ja

ko

ru

sw

te

th

BM25

33.3

36.7

41.3

15.1

28.8

38.2

21.7

28.1

32.9

39.6

42.4

41.7

mDPR

16.7

26.0

25.8

16.2

11.3

14.6

18.1

21.9

18.5

7.3

10.6

13.5

BM25 + mDPR

41.7

49.1

53.5

28.4

36.5

45.5

35.5

36.2

42.7

40.5

42.0

49.2

multilingual-e5-small

64.4

71.5

66.3

54.5

57.7

63.2

55.4

54.3

60.8

65.4

89.1

70.1

multilingual-e5-base

65.9

72.3

65.0

58.5

60.8

64.9

56.6

55.8

62.7

69.0

86.6

72.7

multilingual-e5-large

**70.5**

77.5

73.2

60.8

66.8

68.5

62.5

61.6

65.8

72.7

90.2

76.2

[](#mteb-benchmark-evaluation)MTEB Benchmark Evaluation
-------------------------------------------------------

Check out [unilm/e5](https://github.com/microsoft/unilm/tree/master/e5) to reproduce evaluation results on the [BEIR](https://arxiv.org/abs/2104.08663) and [MTEB benchmark](https://arxiv.org/abs/2210.07316).

[](#support-for-sentence-transformers)Support for Sentence Transformers
-----------------------------------------------------------------------

Below is an example for usage with sentence\_transformers.

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('intfloat/multilingual-e5-small')
    input_texts = [
        'query: how much protein should a female eat',
        'query: ',
        "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 i     s 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or traini     ng for a marathon. Check out the chart below to see how much protein you should be eating each day.",
        "passage: 1. : : : 1     , 2() 3, 4,      2. :1 : : 1, 2     8, 3,, 4,, 5,      6, 7,"
    ]
    embeddings = model.encode(input_texts, normalize_embeddings=True)
    

Package requirements

`pip install sentence_transformers~=2.2.2`

Contributors: [michaelfeil](https://huggingface.co/michaelfeil)

[](#faq)FAQ
-----------

**1\. Do I need to add the prefix "query: " and "passage: " to input texts?**

Yes, this is how the model is trained, otherwise you will see a performance degradation.

Here are some rules of thumb:

*   Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
    
*   Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
    
*   Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
    

**2\. Why are my reproduced results slightly different from reported in the model card?**

Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.

**3\. Why does the cosine similarity scores distribute around 0.7 to 1.0?**

This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.

For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue.

[](#citation)Citation
---------------------

If you find our paper or models helpful, please consider cite as follows:

    @article{wang2024multilingual,
      title={Multilingual E5 Text Embeddings: A Technical Report},
      author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
      journal={arXiv preprint arXiv:2402.05672},
      year={2024}
    }
    

[](#limitations)Limitations
---------------------------

Long texts will be truncated to at most 512 tokens.

## Model overview

The `multilingual-e5-small` model is a text embedding model developed by intfloat. It is a smaller version of the larger `multilingual-e5` models, with 12 layers and an embedding size of 384. The model is based on the [Multilingual MiniLM](https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384) and has been continually trained on a mixture of multilingual datasets to support 100 languages, although low-resource languages may see performance degradation.

The `multilingual-e5-base` and `multilingual-e5-large` models are larger versions of the `multilingual-e5-small` model, with 12 and 24 layers respectively, and embedding sizes of 768 and 1024. These larger models leverage the [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) and [XLM-RoBERTa-Large](https://huggingface.co/xlm-roberta-large) initializations and further training on a variety of multilingual datasets.

The `multilingual-e5-large-instruct` model is an even larger version with 24 layers and a 1024 embedding size. It is initialized from XLM-RoBERTa-Large and fine-tuned on various datasets, including some that provide task-specific instructions to the model.

## Model inputs and outputs

### Inputs
- **Text**: The input text should start with either "query: " or "passage: ", even for non-English text. This is how the model was trained, and using the correct prefix is important for optimal performance.

### Outputs
- **Text embeddings**: The model outputs text embeddings, which are high-dimensional vector representations of the input text. These embeddings can be used for a variety of downstream tasks, such as semantic similarity, information retrieval, and text classification.

## Capabilities

The `multilingual-e5` models excel at multilingual text understanding and retrieval tasks. They have been shown to outperform other popular multilingual models like [mDPR](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) and [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) on the [Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) benchmark, a multilingual question answering and passage retrieval dataset.

The `multilingual-e5-large-instruct` model further extends the capabilities of the `multilingual-e5` models by allowing for customization through natural language instructions. This can be useful for tailoring the text embeddings to specific tasks or scenarios.

## What can I use it for?

The `multilingual-e5` models are well-suited for a variety of text-based applications that require multilingual support, such as:

- **Information retrieval**: Use the text embeddings for semantic search and ranking of web pages, documents, or passages in response to user queries.
- **Question answering**: Leverage the models for finding relevant passages that answer a given question, across multiple languages.
- **Text classification**: Use the text embeddings as features for training classification models on multilingual datasets.
- **Semantic similarity**: Calculate the similarity between text pairs, such as for paraphrase detection or bitext mining.

The `multilingual-e5-large-instruct` model can be particularly useful for applications that benefit from customized text embeddings, such as specialized search engines, personal assistants, or chatbots.

## Things to try

One interesting aspect of the `multilingual-e5` models is the use of a low temperature (0.01) for the InfoNCE contrastive loss during training. This results in the cosine similarity scores of the text embeddings being distributed around 0.7 to 1.0, rather than the more typical range of -1 to 1. 

While this may seem counterintuitive at first, it's important to note that for tasks like text retrieval or semantic similarity, what matters is the relative order of the scores rather than the absolute values. The low temperature helps to amplify the differences between similar and dissimilar text pairs, which can be beneficial for these types of applications.

You can experiment with this behavior and see how it affects the performance of your specific use case.