![Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.](https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face)

**The text embedding set trained by [**Jina AI**](https://jina.ai/).**

[](#quick-start)Quick Start
---------------------------

The easiest way to starting using `jina-embeddings-v2-base-zh` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).

[](#intended-usage--model-info)Intended Usage & Model Info
----------------------------------------------------------

`jina-embeddings-v2-base-zh` is a Chinese/English bilingual text **embedding model** supporting **8192 sequence length**. It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. We have designed it for high performance in mono-lingual & cross-lingual applications and trained it specifically to support mixed Chinese-English input without bias. Additionally, we provide the following embedding models:

`jina-embeddings-v2-base-zh` ******8192** BERT(JinaBERT)JinaBERTBERT[ALiBi](https://arxiv.org/abs/2108.12409) / :

*   [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
*   [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
*   [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English Bilingual embeddings **(you are here)**.
*   [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English Bilingual embeddings.
*   `jina-embeddings-v2-base-es`: Spanish-English Bilingual embeddings (soon).
*   [`jina-embeddings-v2-base-code`](https://huggingface.co/jinaai/jina-embeddings-v2-base-code): 161 million parameters code embeddings.

[](#data--parameters)Data & Parameters
--------------------------------------

The data and training details are described in this [technical report](https://arxiv.org/abs/2402.17016).

[](#usage)Usage
---------------

**Please apply mean pooling when integrating the model.**

### [](#why-mean-pooling)Why mean pooling?

`mean poooling` takes all token embeddings from model output and averaging them at sentence/paragraph level. It has been proved to be the most effective way to produce high-quality sentence embeddings. We offer an `encode` function to deal with this.

However, if you would like to do it without using the default `encode` function:

    import torch
    import torch.nn.functional as F
    from transformers import AutoTokenizer, AutoModel
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ['How is the weather today?', '?']
    
    tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
    model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)

You can use Jina Embedding models directly from transformers package.

    !pip install transformers
    from transformers import AutoModel
    from numpy.linalg import norm
    
    cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
    model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True) # trust_remote_code is needed to use the encode method
    embeddings = model.encode(['How is the weather today?', '?'])
    print(cos_sim(embeddings[0], embeddings[1]))
    

If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:

    embeddings = model.encode(
        ['Very long ... document'],
        max_length=2048
    )
    

If you want to use the model together with the [sentence-transformers package](https://github.com/UKPLab/sentence-transformers/), make sure that you have installed the latest release and set `trust_remote_code=True` as well:

    !pip install -U sentence-transformers
    from sentence_transformers import SentenceTransformer
    from numpy.linalg import norm
    
    cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
    model = SentenceTransformer('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
    embeddings = model.encode(['How is the weather today?', '?'])
    print(cos_sim(embeddings[0], embeddings[1]))
    

Using the its latest release (v2.3.0) sentence-transformers also supports Jina embeddings (Please make sure that you are logged into huggingface as well):

    !pip install -U sentence-transformers
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.util import cos_sim
    
    model = SentenceTransformer(
        "jinaai/jina-embeddings-v2-base-zh", # switch to en/zh for English or Chinese
        trust_remote_code=True
    )
    
    # control your input sequence length up to 8192
    model.max_seq_length = 1024
    
    embeddings = model.encode([
        'How is the weather today?',
        '?'
    ])
    print(cos_sim(embeddings[0], embeddings[1]))
    

[](#alternatives-to-using-transformers-package)Alternatives to Using Transformers Package
-----------------------------------------------------------------------------------------

1.  _Managed SaaS_: Get started with a free key on Jina AI's [Embedding API](https://jina.ai/embeddings/).
2.  _Private and high-performance deployment_: Get started by picking from our suite of models and deploy them on [AWS Sagemaker](https://aws.amazon.com/marketplace/seller-profile?id=seller-stch2ludm6vgy).

[](#use-jina-embeddings-for-rag)Use Jina Embeddings for RAG
-----------------------------------------------------------

According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83),

> In summary, to achieve the peak performance in both hit rate and MRR, the combination of OpenAI or JinaAI-Base embeddings with the CohereRerank/bge-reranker-large reranker stands out.

![](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png)

[](#trouble-shooting)Trouble Shooting
-------------------------------------

**Loading of Model Code failed**

If you forgot to pass the `trust_remote_code=True` flag when calling `AutoModel.from_pretrained` or initializing the model via the `SentenceTransformer` class, you will receive an error that the model weights could not be initialized. This is caused by tranformers falling back to creating a default BERT model, instead of a jina-embedding model:

    Some weights of the model checkpoint at jinaai/jina-embeddings-v2-base-zh were not used when initializing BertModel: ['encoder.layer.2.mlp.layernorm.weight', 'encoder.layer.3.mlp.layernorm.weight', 'encoder.layer.10.mlp.wo.bias', 'encoder.layer.5.mlp.wo.bias', 'encoder.layer.2.mlp.layernorm.bias', 'encoder.layer.1.mlp.gated_layers.weight', 'encoder.layer.5.mlp.gated_layers.weight', 'encoder.layer.8.mlp.layernorm.bias', ...
    

**User is not logged into Huggingface**

The model is only availabe under [gated access](https://huggingface.co/docs/hub/models-gated). This means you need to be logged into huggingface load load it. If you receive the following error, you need to provide an access token, either by using the huggingface-cli or providing the token via an environment variable as described above:

    OSError: jinaai/jina-embeddings-v2-base-zh is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
    If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
    

[](#contact)Contact
-------------------

Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.

[](#citation)Citation
---------------------

If you find Jina Embeddings useful in your research, please cite the following paper:

    @article{mohr2024multi,
      title={Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings},
      author={Mohr, Isabelle and Krimmel, Markus and Sturua, Saba and Akram, Mohammad Kalim and Koukounas, Andreas and G{\"u}nther, Michael and Mastrapas, Georgios and Ravishankar, Vinit and Mart{\'\i}nez, Joan Fontanals and Wang, Feng and others},
      journal={arXiv preprint arXiv:2402.17016},
      year={2024}
    }

## Model Overview

The `jina-embeddings-v2-base-zh` model is a Chinese/English bilingual text embedding model developed by [Jina AI](https://aimodels.fyi/creators/huggingFace/jinaai). It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence lengths of up to 8192 tokens. Compared to other Jina embedding models, `jina-embeddings-v2-base-zh` is a 161 million parameter model trained specifically on mixed Chinese-English input to provide high performance in both mono-lingual and cross-lingual applications.

Similar Jina AI embedding models include [`jina-embeddings-v2-small-en`](https://aimodels.fyi/models/huggingFace/jina-embeddings-v2-small-en-jinaai), [`jina-embeddings-v2-base-en`](https://aimodels.fyi/models/huggingFace/jina-embeddings-v2-base-en-jinaai), [`jina-embeddings-v2-base-de`](https://aimodels.fyi/models/huggingFace/jina-embeddings-v2-base-de-jinaai), and an upcoming `jina-embeddings-v2-base-es` model for Spanish-English bilingual embeddings.

## Model Inputs and Outputs

### Inputs
- **Text sequence**: The model takes in text sequences of up to 8192 tokens, supporting both Chinese and English, as well as a mix of the two.

### Outputs
- **Text embeddings**: The model outputs 768-dimensional embedding vectors that capture the semantic meaning of the input text. These can be used for a variety of downstream tasks like information retrieval, text similarity, and multilingual applications.

## Capabilities

The `jina-embeddings-v2-base-zh` model has been designed to excel at both mono-lingual and cross-lingual tasks involving Chinese and English text. Its long sequence length support of up to 8192 tokens makes it useful for applications that need to process long-form content, such as document retrieval, semantic textual similarity, and text reranking.

## What Can I Use It For?

The `jina-embeddings-v2-base-zh` model can be used for a wide range of natural language processing tasks that require high-quality text embeddings, especially those involving a mix of Chinese and English text. Some potential use cases include:

- **Information Retrieval**: Use the embeddings for semantic search and retrieval of Chinese or English documents, or documents containing a mix of both languages.
- **Text Similarity**: Compute the similarity between Chinese, English, or bilingual text passages to detect paraphrases, identify related content, or perform clustering.
- **Multilingual Applications**: Leverage the model's cross-lingual capabilities to build applications that seamlessly handle Chinese and English input, such as chatbots or question-answering systems.

## Things to Try

An interesting aspect of the `jina-embeddings-v2-base-zh` model is its ability to handle long input sequences of up to 8192 tokens. This makes it well-suited for tasks involving lengthy documents or multi-paragraph inputs. You could experiment with using the model for tasks like:

- Long-form text summarization, where the model's ability to capture semantic meaning in long passages could improve the quality of generated summaries.
- Cross-lingual document retrieval, where the model's bilingual capabilities and long sequence support could help surface relevant content even when the query and target documents are in different languages.
- Multilingual dialog systems, where the model's embeddings could be used to maintain context and coherence across language switches within a conversation.

By exploring the model's unique features, you can uncover novel applications that leverage its strengths in handling long, multilingual text inputs.