**The crispy sentence embedding family from [**mixedbread ai**](https://mixedbread.ai).**

[](#mxbai-embed-large-v1)mxbai-embed-large-v1
=============================================

Here, we provide several ways to produce sentence embeddings. Please note that you have to provide the prompt `Represent this sentence for searching relevant passages:` for query if you want to use it for retrieval. Besides that you don't need any prompt. Our model also supports [Matryoshka Representation Learning and binary quantization](https://www.mixedbread.ai/blog/binary-mrl).

[](#quickstart)Quickstart
-------------------------

Here, we provide several ways to produce sentence embeddings. Please note that you have to provide the prompt `Represent this sentence for searching relevant passages:` for query if you want to use it for retrieval. Besides that you don't need any prompt.

### [](#sentence-transformers)sentence-transformers

    python -m pip install -U sentence-transformers
    

    from sentence_transformers import SentenceTransformer
    from sentence_transformers.util import cos_sim
    from sentence_transformers.quantization import quantize_embeddings
    
    # 1. Specify preffered dimensions
    dimensions = 512
    
    # 2. load model
    model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", truncate_dim=dimensions)
    
    # For retrieval you need to pass this prompt.
    query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'
    
    docs = [
        query,
        "A man is eating food.",
        "A man is eating pasta.",
        "The girl is carrying a baby.",
        "A man is riding a horse.",
    ]
    
    # 2. Encode
    embeddings = model.encode(docs)
    
    # Optional: Quantize the embeddings
    binary_embeddings = quantize_embeddings(embeddings, precision="ubinary")
    
    similarities = cos_sim(embeddings[0], embeddings[1:])
    print('similarities:', similarities)
    
    

### [](#transformers)Transformers

    from typing import Dict
    
    import torch
    import numpy as np
    from transformers import AutoModel, AutoTokenizer
    from sentence_transformers.util import cos_sim
    
    # For retrieval you need to pass this prompt. Please find our more in our blog post.
    def transform_query(query: str) -> str:
        """ For retrieval, add the prompt for query (not for documents).
        """
        return f'Represent this sentence for searching relevant passages: {query}'
    
    # The model works really well with cls pooling (default) but also with mean pooling.
    def pooling(outputs: torch.Tensor, inputs: Dict,  strategy: str = 'cls') -> np.ndarray:
        if strategy == 'cls':
            outputs = outputs[:, 0]
        elif strategy == 'mean':
            outputs = torch.sum(
                outputs * inputs["attention_mask"][:, :, None], dim=1) / torch.sum(inputs["attention_mask"])
        else:
            raise NotImplementedError
        return outputs.detach().cpu().numpy()
    
    # 1. load model
    model_id = 'mixedbread-ai/mxbai-embed-large-v1'
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModel.from_pretrained(model_id).cuda()
    
    
    docs = [
        transform_query('A man is eating a piece of bread'),
        "A man is eating food.",
        "A man is eating pasta.",
        "The girl is carrying a baby.",
        "A man is riding a horse.",
    ]
    
    # 2. encode
    inputs = tokenizer(docs, padding=True, return_tensors='pt')
    for k, v in inputs.items():
        inputs[k] = v.cuda()
    outputs = model(**inputs).last_hidden_state
    embeddings = pooling(outputs, inputs, 'cls')
    
    similarities = cos_sim(embeddings[0], embeddings[1:])
    print('similarities:', similarities)
    

### [](#transformersjs)Transformers.js

If you haven't already, you can install the [Transformers.js](https://huggingface.co/docs/transformers.js) JavaScript library from [NPM](https://www.npmjs.com/package/@xenova/transformers) using:

    npm i @xenova/transformers
    

You can then use the model to compute embeddings like this:

    import { pipeline, cos_sim } from '@xenova/transformers';
    
    // Create a feature extraction pipeline
    const extractor = await pipeline('feature-extraction', 'mixedbread-ai/mxbai-embed-large-v1', {
        quantized: false, // Comment out this line to use the quantized version
    });
    
    // Generate sentence embeddings
    const docs = [
        'Represent this sentence for searching relevant passages: A man is eating a piece of bread',
        'A man is eating food.',
        'A man is eating pasta.',
        'The girl is carrying a baby.',
        'A man is riding a horse.',
    ]
    const output = await extractor(docs, { pooling: 'cls' });
    
    // Compute similarity scores
    const [source_embeddings, ...document_embeddings ] = output.tolist();
    const similarities = document_embeddings.map(x => cos_sim(source_embeddings, x));
    console.log(similarities); // [0.7919578577247139, 0.6369278664248345, 0.16512018371357193, 0.3620778366720027]
    

### [](#using-api)Using API

You can use the model via our API as follows:

    from mixedbread_ai.client import MixedbreadAI, EncodingFormat
    from sklearn.metrics.pairwise import cosine_similarity
    import os
    
    mxbai = MixedbreadAI(api_key="{MIXEDBREAD_API_KEY}")
    
    english_sentences = [
        'What is the capital of Australia?',
        'Canberra is the capital of Australia.'
    ] 
    
    res = mxbai.embeddings(
         input=english_sentences,
         model="mixedbread-ai/mxbai-embed-large-v1",
         normalized=True,
         encoding_format=[EncodingFormat.FLOAT, EncodingFormat.UBINARY, EncodingFormat.INT_8],
         dimensions=512
    )
    
    encoded_embeddings = res.data[0].embedding
    print(res.dimensions, encoded_embeddings.ubinary, encoded_embeddings.float_, encoded_embeddings.int_8)
    

The API comes with native int8 and binary quantization support! Check out the [docs](https://mixedbread.ai/docs) for more information.

[](#evaluation)Evaluation
-------------------------

As of March 2024, our model archives SOTA performance for Bert-large sized models on the [MTEB](https://huggingface.co/spaces/mteb/leaderboard). It ourperforms commercial models like OpenAIs text-embedding-3-large and matches the performance of model 20x it's size like the [echo-mistral-7b](https://huggingface.co/jspringer/echo-mistral-7b-instruct-lasttoken). Our model was trained with no overlap of the MTEB data, which indicates that our model generalizes well across several domains, tasks and text length. We know there are some limitations with this model, which will be fixed in v2.

Model

Avg (56 datasets)

Classification (12 datasets)

Clustering (11 datasets)

PairClassification (3 datasets)

Reranking (4 datasets)

Retrieval (15 datasets)

STS (10 datasets)

Summarization (1 dataset)

**mxbai-embed-large-v1**

**64.68**

75.64

46.71

87.2

60.11

54.39

85.00

32.71

[bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)

64.23

75.97

46.08

87.12

60.03

54.29

83.11

31.61

[mxbai-embed-2d-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-2d-large-v1)

63.25

74.14

46.07

85.89

58.94

51.42

84.9

31.55

[nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1)

62.39

74.12

43.91

85.15

55.69

52.81

82.06

30.08

[jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en)

60.38

73.45

41.73

85.38

56.98

47.87

80.7

31.6

_Proprietary Models_

[OpenAI text-embedding-3-large](https://openai.com/blog/new-embedding-models-and-api-updates)

64.58

75.45

49.01

85.72

59.16

55.44

81.73

29.92

[Cohere embed-english-v3.0](https://txt.cohere.com/introducing-embed-v3/)

64.47

76.49

47.43

85.84

58.01

55.00

82.62

30.18

[OpenAI text-embedding-ada-002](https://openai.com/blog/new-and-improved-embedding-model)

60.99

70.93

45.90

84.89

56.32

49.25

80.97

30.80

Please find more information in our [blog post](https://mixedbread.ai/blog/mxbai-embed-large-v1).

[](#matryoshka-and-binary-quantization)Matryoshka and Binary Quantization
-------------------------------------------------------------------------

Embeddings in their commonly used form (float arrays) have a high memory footprint when used at scale. Two approaches to solve this problem are Matryoshka Representation Learning (MRL) and (Binary) Quantization. While MRL reduces the number of dimensions of an embedding, binary quantization transforms the value of each dimension from a float32 into a lower precision (int8 or even binary). **The model supports both approaches!**

You can also take it one step further, and combine both MRL and quantization. This combination of binary quantization and MRL allows you to reduce the memory usage of your embeddings significantly. This leads to much lower costs when using a vector database in particular. You can read more about the technology and its advantages in our [blog post](https://www.mixedbread.ai/blog/binary-mrl).

[](#community)Community
-----------------------

Please join our [Discord Community](https://discord.gg/jDfMHzAVfU) and share your feedback and thoughts! We are here to help and also always happy to chat.

[](#license)License
-------------------

Apache 2.0

[](#citation)Citation
---------------------

    @online{emb2024mxbai,
      title={Open Source Strikes Bread - New Fluffy Embeddings Model},
      author={Sean Lee, Aamir Shakir, Darius Koenig, Julius Lipp},
      year={2024},
      url={https://www.mixedbread.ai/blog/mxbai-embed-large-v1},
    }
    
    @article{li2023angle,
      title={AnglE-optimized Text Embeddings},
      author={Li, Xianming and Li, Jing},
      journal={arXiv preprint arXiv:2309.12871},
      year={2023}
    }

## Model overview

The `mxbai-embed-large-v1` model is part of the "crispy sentence embedding family" from [**mixedbread ai**](https://aimodels.fyi/creators/huggingFace/mixedbread-ai). This is a large-scale sentence embedding model that can be used for a variety of text-related tasks such as semantic search, passage retrieval, and text clustering.

The model has been trained on a large and diverse dataset of sentence pairs, using a contrastive learning objective to produce embeddings that capture the semantic meaning of the input text. This approach allows the model to learn rich representations that can be effectively used for downstream applications.

Compared to similar models like [mxbai-rerank-large-v1](https://aimodels.fyi/models/huggingFace/mxbai-rerank-large-v1-mixedbread-ai) and [multi-qa-MiniLM-L6-cos-v1](https://aimodels.fyi/models/huggingFace/multi-qa-minilm-l6-cos-v1-sentence-transformers), the `mxbai-embed-large-v1` model focuses more on general-purpose sentence embeddings rather than specifically optimizing for retrieval or question-answering tasks.

## Model inputs and outputs

### Inputs
- **Text**: The model can take a single sentence or a list of sentences as input.

### Outputs
- **Sentence embeddings**: The model outputs a dense vector representation for each input sentence. The embeddings can be used for a variety of downstream tasks.

## Capabilities

The `mxbai-embed-large-v1` model can be used for a wide range of text-related tasks, including:

- **Semantic search**: The sentence embeddings can be used to find semantically similar passages or documents for a given query.
- **Text clustering**: The embeddings can be used to group similar sentences or documents together based on their semantic content.
- **Text classification**: The embeddings can be used as features for training classifiers on text data.
- **Sentence similarity**: The cosine similarity between two sentence embeddings can be used to measure the semantic similarity between the corresponding sentences.

## What can I use it for?

The `mxbai-embed-large-v1` model can be a powerful tool for a variety of applications, such as:

- **Knowledge management**: Use the model to efficiently organize and retrieve relevant information from large text corpora, such as research papers, product documentation, or customer support queries.
- **Recommendation systems**: Leverage the semantic understanding of the model to suggest relevant content or products to users based on their search queries or browsing history.
- **Chatbots and virtual assistants**: Incorporate the model's language understanding capabilities to improve the relevance and coherence of responses in conversational AI systems.
- **Content analysis**: Apply the model to tasks like topic modeling, sentiment analysis, or text summarization to gain insights from large volumes of unstructured text data.

## Things to try

One interesting aspect of the `mxbai-embed-large-v1` model is its support for [Matryoshka Representation Learning and binary quantization](https://www.mixedbread.ai/blog/binary-mrl). This technique allows the model to produce efficient, low-dimensional representations of the input text, which can be particularly useful for applications with constrained computational resources or memory requirements.

Another area to explore is the model's performance on domain-specific tasks. While the model is trained on a broad, general-purpose dataset, fine-tuning it on more specialized corpora may lead to improved results for certain applications, such as legal document retrieval or clinical text analysis.