bge-large-zh

Maintainer: BAAI

Total Score

290

Last updated 5/27/2024

🖼️

PropertyValue
Model LinkView on HuggingFace
API SpecView on HuggingFace
Github LinkNo Github link provided
Paper LinkNo paper link provided

Get summaries of the top AI models delivered straight to your inbox:

Model overview

The bge-large-zh model is a state-of-the-art text embedding model developed by the Beijing Academy of Artificial Intelligence (BAAI). It is part of the BAAI General Embedding (BGE) family of models, which have achieved top performance on both the MTEB and C-MTEB benchmarks. The bge-large-zh model is specifically designed for Chinese text processing, and it can map any Chinese text into a low-dimensional dense vector that can be used for tasks like retrieval, classification, clustering, or semantic search.

Compared to similar models like BAAI/bge-large-en and BAAI/bge-small-en, the bge-large-zh model has been optimized for Chinese text and has demonstrated state-of-the-art performance on Chinese benchmarks. The BAAI/llm-embedder model is a more recent addition to the BAAI family, serving as a unified embedding model to support diverse retrieval augmentation needs for large language models (LLMs).

Model inputs and outputs

Inputs

  • Text: The bge-large-zh model can take any Chinese text as input, ranging from short queries to long passages.
  • Instruction (optional): For retrieval tasks that use short queries to find long related documents, it is recommended to add an instruction to the query to help the model better understand the intent. The instruction should be placed at the beginning of the query text. No instruction is needed for the passage/document text.

Outputs

  • Embeddings: The primary output of the bge-large-zh model is a dense vector embedding of the input text. These embeddings can be used for a variety of downstream tasks, such as:
    • Retrieval: The embeddings can be used to find related passages or documents by computing the similarity between the query embedding and the passage/document embeddings.
    • Classification: The embeddings can be used as features for training classification models.
    • Clustering: The embeddings can be used to group similar text together.
    • Semantic search: The embeddings can be used to find semantically related text.

Capabilities

The bge-large-zh model demonstrates state-of-the-art performance on a range of Chinese text processing tasks. On the Chinese Massive Text Embedding Benchmark (C-MTEB), the bge-large-zh-v1.5 model ranked first overall, showing strong results across tasks like retrieval, semantic similarity, and classification.

Additionally, the bge-large-zh model has been designed to handle long input text, with a maximum sequence length of 512 tokens. This makes it well-suited for tasks that involve processing lengthy passages or documents, such as research paper retrieval or legal document search.

What can I use it for?

The bge-large-zh model can be used for a variety of Chinese text processing tasks, including:

  • Retrieval: Use the model to find relevant passages or documents given a query. This can be helpful for building search engines, Q&A systems, or knowledge management tools.
  • Classification: Use the model's embeddings as features to train classification models for tasks like sentiment analysis, topic classification, or intent detection.
  • Clustering: Group similar Chinese text together using the model's embeddings, which can be useful for organizing large collections of documents or categorizing user-generated content.
  • Semantic search: Find semantically related text by computing the similarity between the model's embeddings, enabling more advanced search experiences.

Things to try

One interesting aspect of the bge-large-zh model is its ability to handle queries with or without instruction. While adding an instruction to the query can improve retrieval performance, the model's v1.5 version has been enhanced to perform well even without the instruction. This makes it more convenient to use in certain applications, as you don't need to worry about crafting the perfect query instruction.

Another thing to try is fine-tuning the bge-large-zh model on your own data. The provided examples show how you can prepare data and fine-tune the model to improve its performance on your specific use case. This can be particularly helpful if you have domain-specific text that the pre-trained model doesn't handle as well.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🔄

bge-large-en

BAAI

Total Score

181

The bge-large-en model is a text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence). It is part of the BAAI General Embedding (BGE) family of models, which can map text to low-dimensional dense vectors for tasks like retrieval, classification, and semantic search. The maintainers recommend using the newer BAAI/bge-large-en-v1.5 model, which has a more reasonable similarity distribution and the same usage method. Model inputs and outputs Inputs Text sequences of up to 512 tokens Outputs 1024-dimensional dense vector embeddings Capabilities The bge-large-en model can generate high-quality text embeddings that capture semantic meaning. These embeddings can be used for a variety of downstream tasks, such as: Retrieval**: Finding relevant documents or passages given a query Classification**: Classifying text into predefined categories Clustering**: Grouping similar text documents together Semantic search**: Searching for relevant content based on meaning, not just keywords What can I use it for? The bge-large-en embeddings can be leveraged in various applications that require understanding the semantic meaning of text. For example, you could use them to build a powerful search engine that returns relevant results based on the query's intent, rather than just matching keywords. Another potential use case is intelligent document retrieval and recommendation, where the model can surface the most relevant information to users based on their needs. This could be especially useful in enterprise settings or academic research, where users need to quickly find relevant information among large document collections. Things to try One interesting experiment would be to fine-tune the bge-large-en model on a specific domain or task, such as legal document retrieval or scientific paper recommendation. This could help the model better capture the nuances and specialized vocabulary of your particular use case. You could also explore using the bge-large-en embeddings in combination with other techniques, such as sparse lexical matching or multi-vector retrieval, to create a hybrid search system that leverages the strengths of different approaches.

Read more

Updated Invalid Date

📈

bge-base-en

BAAI

Total Score

53

The bge-base-en is a text embedding model developed by the BAAI (Beijing Academy of Artificial Intelligence) that can map any text to a low-dimensional dense vector. It is part of the BAAI/bge-base-en model series, which also includes larger and smaller scale versions such as BAAI/bge-large-en and BAAI/bge-small-en. These models were trained using contrastive learning on a massive text corpus and demonstrate state-of-the-art performance on text embedding benchmarks like MTEB and C-MTEB. The bge-base-en model is a base-scale version that achieves similar performance to the larger bge-large-en model, making it a good option for applications with limited compute resources. All models in the BAAI/bge series have been recommended to use the newest v1.5 versions, which have an improved similarity distribution compared to earlier versions. Model inputs and outputs Inputs Text**: The bge-base-en model can take any text as input, such as a sentence, paragraph, or document. Instruction (optional)**: For text retrieval tasks, the input text can optionally be prefixed with an instruction to improve performance, such as "Represent this sentence for searching relevant passages:". Outputs Embedding vector**: The model outputs a fixed-size dense vector representation of the input text, which can be used for downstream tasks like retrieval, classification, clustering, or semantic search. Capabilities The bge-base-en model is a powerful text embedding model that can capture the semantic meaning of input text in a compact vector representation. It has been shown to excel at a variety of NLP tasks, achieving top performance on the MTEB and C-MTEB benchmarks. Some key capabilities of the model include: Retrieval**: The embedding vectors can be used to efficiently search large text corpora to find relevant documents or passages for a given query. Classification**: The embeddings can be leveraged as features for training classifiers on text data. Clustering**: The vector representations allow for effective grouping of similar text items. Semantic search**: The model can identify semantically related texts based on the proximity of their embedding vectors. What can I use it for? The bge-base-en model is a highly versatile tool that can be applied to a wide range of NLP applications. Some potential use cases include: Intelligent search**: Integrating the model into search engines or knowledge bases to enable more accurate and semantically-aware retrieval of information. Recommender systems**: Using the text embeddings to identify related content or products for recommendation. Content analysis**: Leveraging the model's ability to capture semantic meaning for tasks like topic modeling, sentiment analysis, or text summarization. Multimodal applications**: Combining the text embeddings with visual or audio representations for applications like image/video captioning or multimedia search. Things to try One interesting aspect of the bge-base-en model is its ability to generate high-quality embeddings without requiring an instruction prefix, while still maintaining strong retrieval performance. This makes the model convenient to use in many scenarios where adding an instruction may not be practical. Another thing to explore is fine-tuning the model on your own data using the provided examples. By incorporating domain-specific knowledge, you can further improve the model's performance on tasks relevant to your application. The FlagEmbedding library provides guidance on how to effectively fine-tune the bge models. Finally, you can experiment with using the bge-base-en model in combination with the larger bge-large-en model or the bge-reranker models to further enhance retrieval performance. The reranker models can be used to re-rank the top results from the embedding model, providing a more accurate relevance score.

Read more

Updated Invalid Date

📈

bge-small-en

BAAI

Total Score

65

The bge-small-en model is a small-scale English text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence) as part of their FlagEmbedding project. It is one of several bge (BAAI General Embedding) models that achieve state-of-the-art performance on text embedding benchmarks like MTEB and C-MTEB. The bge-small-en model is a smaller version of the BAAI/bge-large-en-v1.5 and BAAI/bge-base-en-v1.5 models, with 384 embedding dimensions compared to 1024 and 768 respectively. Despite its smaller size, the bge-small-en model still provides competitive performance, making it a good choice when computation resources are limited. Model inputs and outputs Inputs Text sentences**: The model can take a list of text sentences as input. Outputs Sentence embeddings**: The model outputs a numpy array of sentence embeddings, where each row corresponds to the embedding of the corresponding input sentence. Capabilities The bge-small-en model can be used for a variety of natural language processing tasks that benefit from semantic text representations, such as: Information retrieval**: The embeddings can be used to find relevant passages or documents for a given query, by computing similarity scores between the query and the passages/documents. Text classification**: The embeddings can be used as features for training classification models on text data. Clustering**: The embeddings can be used to group similar text documents into clusters. Semantic search**: The embeddings can be used to find semantically similar text based on their meaning, rather than just lexical matching. What can I use it for? The bge-small-en model can be a useful tool for a variety of applications that involve working with English text data. For example, you could use it to build a semantic search engine for your company's knowledge base, or to improve the text classification capabilities of your customer support chatbot. Since the model is smaller and more efficient than the larger bge models, it may be particularly well-suited for deployment on edge devices or in resource-constrained environments. You could also fine-tune the model on your specific text data to further improve its performance for your use case. Things to try One interesting thing to try with the bge-small-en model is to compare its performance to the larger bge models, such as BAAI/bge-large-en-v1.5 and BAAI/bge-base-en-v1.5, on your specific tasks. You may find that the smaller model provides nearly the same performance as the larger models, while being more efficient and easier to deploy. Another thing to try is to fine-tune the bge-small-en model on your own text data, using the techniques described in the FlagEmbedding documentation. This can help the model better capture the semantics of your domain-specific text, potentially leading to improved performance on your tasks.

Read more

Updated Invalid Date

llm-embedder

BAAI

Total Score

92

llm-embedder is a text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence) that can map any text to a low-dimensional dense vector. This can be used for tasks like retrieval, classification, clustering, and semantic search. It is part of the FlagEmbedding project, which also includes other models like bge-reranker-base and bge-reranker-large. The model is available in multiple sizes, including bge-large-en-v1.5, bge-base-en-v1.5, and bge-small-en-v1.5. These models have been optimized to have more reasonable similarity distributions and enhanced retrieval abilities compared to earlier versions. Model inputs and outputs Inputs Text to be embedded Outputs Low-dimensional dense vector representation of the input text Capabilities The llm-embedder model can generate high-quality embeddings that capture the semantic meaning of text. These embeddings can then be used in a variety of downstream applications, such as: Information retrieval: Finding relevant documents or passages for a given query Text classification: Categorizing text into different classes or topics Clustering: Grouping similar text together Semantic search: Finding text that is semantically similar to a given query The model has been shown to achieve state-of-the-art performance on benchmarks like MTEB and C-MTEB. What can I use it for? The llm-embedder model can be useful in a wide range of applications that require understanding the semantic content of text, such as: Building search engines or recommendation systems that can retrieve relevant information based on user queries Developing chatbots or virtual assistants that can engage in more natural conversations by understanding the context and meaning of user inputs Improving the accuracy of text classification models for tasks like sentiment analysis, topic modeling, or spam detection Powering knowledge management systems that can organize and retrieve information based on the conceptual relationships between documents Additionally, the model can be fine-tuned on domain-specific data to improve its performance for specific use cases. Things to try One interesting aspect of the llm-embedder model is its support for retrieval augmentation for large language models (LLMs). The LLM-Embedder variant of the model is designed to provide a unified embedding solution to support diverse retrieval needs for LLMs. Another interesting direction to explore is the use of the bge-reranker-base and bge-reranker-large models, which are cross-encoder models that can be used to re-rank the top-k documents retrieved by the embedding model. This can help improve the overall accuracy of the retrieval system.

Read more

Updated Invalid Date