## Model overview

The `bge-large-zh` model is a state-of-the-art text embedding model developed by the Beijing Academy of Artificial Intelligence (BAAI). It is part of the BAAI General Embedding (BGE) family of models, which have achieved top performance on both the MTEB and C-MTEB benchmarks. The `bge-large-zh` model is specifically designed for Chinese text processing, and it can map any Chinese text into a low-dimensional dense vector that can be used for tasks like retrieval, classification, clustering, or semantic search.

Compared to similar models like [BAAI/bge-large-en](https://aimodels.fyi/models/huggingFace/bge-large-en-baai) and [BAAI/bge-small-en](https://aimodels.fyi/models/huggingFace/bge-small-en-baai), the `bge-large-zh` model has been optimized for Chinese text and has demonstrated state-of-the-art performance on Chinese benchmarks. The [BAAI/llm-embedder](https://aimodels.fyi/models/huggingFace/llm-embedder-baai) model is a more recent addition to the BAAI family, serving as a unified embedding model to support diverse retrieval augmentation needs for large language models (LLMs).

## Model inputs and outputs

### Inputs
- **Text**: The `bge-large-zh` model can take any Chinese text as input, ranging from short queries to long passages.
- **Instruction (optional)**: For retrieval tasks that use short queries to find long related documents, it is recommended to add an instruction to the query to help the model better understand the intent. The instruction should be placed at the beginning of the query text. No instruction is needed for the passage/document text.

### Outputs
- **Embeddings**: The primary output of the `bge-large-zh` model is a dense vector embedding of the input text. These embeddings can be used for a variety of downstream tasks, such as:
  - Retrieval: The embeddings can be used to find related passages or documents by computing the similarity between the query embedding and the passage/document embeddings.
  - Classification: The embeddings can be used as features for training classification models.
  - Clustering: The embeddings can be used to group similar text together.
  - Semantic search: The embeddings can be used to find semantically related text.

## Capabilities

The `bge-large-zh` model demonstrates state-of-the-art performance on a range of Chinese text processing tasks. On the Chinese Massive Text Embedding Benchmark (C-MTEB), the `bge-large-zh-v1.5` model ranked first overall, showing strong results across tasks like retrieval, semantic similarity, and classification.

Additionally, the `bge-large-zh` model has been designed to handle long input text, with a maximum sequence length of 512 tokens. This makes it well-suited for tasks that involve processing lengthy passages or documents, such as research paper retrieval or legal document search.

## What can I use it for?

The `bge-large-zh` model can be used for a variety of Chinese text processing tasks, including:

- **Retrieval**: Use the model to find relevant passages or documents given a query. This can be helpful for building search engines, Q&A systems, or knowledge management tools.
- **Classification**: Use the model's embeddings as features to train classification models for tasks like sentiment analysis, topic classification, or intent detection.
- **Clustering**: Group similar Chinese text together using the model's embeddings, which can be useful for organizing large collections of documents or categorizing user-generated content.
- **Semantic search**: Find semantically related text by computing the similarity between the model's embeddings, enabling more advanced search experiences.

## Things to try

One interesting aspect of the `bge-large-zh` model is its ability to handle queries with or without instruction. While adding an instruction to the query can improve retrieval performance, the model's v1.5 version has been enhanced to perform well even without the instruction. This makes it more convenient to use in certain applications, as you don't need to worry about crafting the perfect query instruction.

Another thing to try is fine-tuning the `bge-large-zh` model on your own data. The provided [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) show how you can prepare data and fine-tune the model to improve its performance on your specific use case. This can be particularly helpful if you have domain-specific text that the pre-trained model doesn't handle as well.