[](#gte-large)gte-large
=======================

General Text Embeddings (GTE) model. [Towards General Text Embeddings with Multi-stage Contrastive Learning](https://arxiv.org/abs/2308.03281)

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including [GTE-large](https://huggingface.co/thenlper/gte-large), [GTE-base](https://huggingface.co/thenlper/gte-base), and [GTE-small](https://huggingface.co/thenlper/gte-small). The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including **information retrieval**, **semantic textual similarity**, **text reranking**, etc.

[](#metrics)Metrics
-------------------

We compared the performance of the GTE models with other popular text embedding models on the MTEB benchmark. For more detailed comparison results, please refer to the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

Model Name

Model Size (GB)

Dimension

Sequence Length

Average (56)

Clustering (11)

Pair Classification (3)

Reranking (4)

Retrieval (15)

STS (10)

Summarization (1)

Classification (12)

[**gte-large**](https://huggingface.co/thenlper/gte-large)

0.67

1024

512

**63.13**

46.84

85.00

59.13

52.22

83.35

31.66

73.33

[**gte-base**](https://huggingface.co/thenlper/gte-base)

0.22

768

512

**62.39**

46.2

84.57

58.61

51.14

82.3

31.17

73.01

[e5-large-v2](https://huggingface.co/intfloat/e5-large-v2)

1.34

1024

512

62.25

44.49

86.03

56.61

50.56

82.05

30.19

75.24

[e5-base-v2](https://huggingface.co/intfloat/e5-base-v2)

0.44

768

512

61.5

43.80

85.73

55.91

50.29

81.05

30.28

73.84

[**gte-small**](https://huggingface.co/thenlper/gte-small)

0.07

384

512

**61.36**

44.89

83.54

57.7

49.46

82.07

30.42

72.31

[text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings)

\-

1536

8192

60.99

45.9

84.89

56.32

49.25

80.97

30.8

70.93

[e5-small-v2](https://huggingface.co/intfloat/e5-base-v2)

0.13

384

512

59.93

39.92

84.67

54.32

49.04

80.39

31.16

72.94

[sentence-t5-xxl](https://huggingface.co/sentence-transformers/sentence-t5-xxl)

9.73

768

512

59.51

43.72

85.06

56.42

42.24

82.63

30.08

73.42

[all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)

0.44

768

514

57.78

43.69

83.04

59.36

43.81

80.28

27.49

65.07

[sgpt-bloom-7b1-msmarco](https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco)

28.27

4096

2048

57.59

38.93

81.9

55.65

48.22

77.74

33.6

66.19

[all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

0.13

384

512

56.53

41.81

82.41

58.44

42.69

79.8

27.9

63.21

[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

0.09

384

512

56.26

42.35

82.37

58.04

41.95

78.9

30.81

63.05

[contriever-base-msmarco](https://huggingface.co/nthakur/contriever-base-msmarco)

0.44

768

512

56.00

41.1

82.54

53.14

41.88

76.51

30.36

66.68

[sentence-t5-base](https://huggingface.co/sentence-transformers/sentence-t5-base)

0.22

768

512

55.27

40.21

85.18

53.09

33.63

81.14

31.39

69.81

[](#usage)Usage
---------------

Code example

    import torch.nn.functional as F
    from torch import Tensor
    from transformers import AutoTokenizer, AutoModel
    
    def average_pool(last_hidden_states: Tensor,
                     attention_mask: Tensor) -> Tensor:
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    
    input_texts = [
        "what is the capital of China?",
        "how to implement quick sort in python?",
        "Beijing",
        "sorting algorithms"
    ]
    
    tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large")
    model = AutoModel.from_pretrained("thenlper/gte-large")
    
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
    
    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    # (Optionally) normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    scores = (embeddings[:1] @ embeddings[1:].T) * 100
    print(scores.tolist())
    

Use with sentence-transformers:

    from sentence_transformers import SentenceTransformer
    from sentence_transformers.util import cos_sim
    
    sentences = ['That is a happy person', 'That is a very happy person']
    
    model = SentenceTransformer('thenlper/gte-large')
    embeddings = model.encode(sentences)
    print(cos_sim(embeddings[0], embeddings[1]))
    

### [](#limitation)Limitation

This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.

### [](#citation)Citation

If you find our paper or models helpful, please consider citing them as follows:

    @article{li2023towards,
      title={Towards general text embeddings with multi-stage contrastive learning},
      author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
      journal={arXiv preprint arXiv:2308.03281},
      year={2023}
    }

## Model overview

The `gte-large` model is a general text embedding model created by Alibaba DAMO Academy. It is based on the BERT framework and is one of three different model sizes offered, including `gte-base` and `gte-small`. The GTE models are trained on a large-scale corpus of relevant text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream text embedding tasks such as [information retrieval](https://aimodels.fyi/creators/huggingFace/thenlper), [semantic textual similarity](https://aimodels.fyi/creators/huggingFace/thenlper), and [text reranking](https://aimodels.fyi/creators/huggingFace/thenlper).

The `multilingual-e5-large` model is a large multilingual text embedding model created by Microsoft researchers. It is based on the XLM-RoBERTa architecture and supports over 100 languages. The model is pre-trained on a diverse set of datasets including Wikipedia, CCNews, and NLLB, then fine-tuned on tasks like passage retrieval, question answering, and natural language inference.

Both the GTE and E5 models aim to provide high-quality text embeddings that can be used for a variety of language tasks. The GTE models focus on general-purpose text understanding, while the E5 models specialize more in multilingual applications.

## Model inputs and outputs

### Inputs
- **Text sequences**: The model accepts text sequences as input, which can be short queries, long passages, or any other natural language text.

### Outputs
- **Text embeddings**: The primary output of the model is a dense vector representation (embedding) for each input text sequence. These embeddings capture the semantic meaning and relationships between the input texts.
- **Similarity scores**: For tasks like passage retrieval or semantic textual similarity, the model can also output pairwise similarity scores between input text sequences.

## Capabilities

The `gte-large` model excels at a variety of text embedding tasks, as evidenced by its strong performance on the MTEB benchmark. It achieves state-of-the-art results in areas like information retrieval, semantic textual similarity, and text reranking.

The `multilingual-e5-large` model is particularly adept at multilingual tasks. It demonstrates impressive performance on the Mr. TyDi benchmark, which evaluates passage retrieval across 11 diverse languages. The model's broad language support makes it a useful tool for applications that need to handle text in multiple languages.

Both models can be fine-tuned on domain-specific data to further optimize their performance for particular use cases. The provided fine-tuning examples show how to effectively adapt the models to your own requirements.

## What can I use it for?

The `gte-large` and `multilingual-e5-large` models are versatile tools that can be applied to a wide range of NLP tasks. Some potential use cases include:

- **Information retrieval**: Use the models to find relevant documents or passages given a search query.
- **Semantic search**: Leverage the models' text embeddings to build semantic search engines that can understand user intent beyond just keyword matching.
- **Chatbots and virtual assistants**: Incorporate the models into conversational AI systems to improve understanding of user queries and provide more relevant responses.
- **Content recommendation**: Use the models to identify similar content or recommend relevant items to users based on their interests or browsing history.
- **Multilingual applications**: Take advantage of the `multilingual-e5-large` model's broad language support to build applications that can handle text in multiple languages.

## Things to try

One interesting aspect of the `gte-large` and `multilingual-e5-large` models is their ability to handle short queries and long passages effectively. For tasks like passage retrieval, you can experiment with adding a simple instruction prefix to the query (e.g., "Represent this sentence for searching relevant passages:") to see if it improves the model's performance.

Another area to explore is the models' robustness to domain-specific terminology or jargon. You can try fine-tuning the models on your own dataset to see if it enhances their ability to understand and relate specialized content.

Finally, the provided fine-tuning examples demonstrate techniques like mining hard negatives, which can be a powerful way to further enhance the models' embedding quality and downstream task performance.