Maidalun1020

Models by this creator

↗️

bce-embedding-base_v1

maidalun1020

Total Score

203

BCEmbedding: Bilingual and Crosslingual Embedding for RAG bce-embedding-base\_v1The latest "Updates" should be checked in GitHub (Key Features) (Bilingual and Crosslingual capability in English and Chinese) RAG(RAG adaptation for more domains, including Education, Law, Finance, Medical, Literature, FAQ, Textbook, Wikipedia, etc.) langchainllamaindex(Easy integrations for langchain and llamaindex in BCEmbedding) EmbeddingModelinstruction (No need for "instruction") Best practice** embeddingtop50-100reranker50-100top5-101. Get top 50-100 passages with bce-embedding-base\_v1 for "recall" 2. Rerank passages with bce-reranker-base\_v1 and get top 5-10 for "precision" finally. News: BCEmbedding Technical Blog : RAG-BCEmbedding Related link for RerankerModel : bce-reranker-base\_v1 Third-party Examples: RAG applications: QAnything, HuixiangDou, ChatPDF. Efficient inference framework: ChatLLM.cpp, Xinference, mindnlp (Huawei GPU, GPU). image/jpeg image/jpeg Click to Open Contents Bilingual and Crosslingual Superiority Key Features Latest Updates Model List Manual Installation Quick Start (transformers, sentence-transformers) Integrations for RAG Frameworks (langchain, llama_index) Evaluation Evaluate Semantic Representation by MTEB Evaluate RAG by LlamaIndex Leaderboard Semantic Representation Evaluations in MTEB RAG Evaluations in LlamaIndex Youdao's BCEmbedding API WeChat Group Citation License Related Links Bilingual and Crosslingual Embedding (BCEmbedding), developed by NetEase Youdao, encompasses EmbeddingModel and RerankerModel. The EmbeddingModel specializes in generating semantic vectors, playing a crucial role in semantic search and question-answering, and the RerankerModel excels at refining search results and ranking tasks. BCEmbedding serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implmentation, notably QAnything \[github\], an open-source implementation widely integrated in various Youdao products like Youdao Speed Reading and Youdao Translation. Distinguished for its bilingual and crosslingual proficiency, BCEmbedding excels in bridging Chinese and English linguistic gaps, which achieves A high performence on Semantic Representation Evaluations in MTEB**; A new benchmark in the realm of RAG Evaluations in LlamaIndex**. BCEmbeddingEmbeddingModelRerankerModelEmbeddingModelRerankerModel BCEmbeddingRAGQAnything \[github\]QAnything BCEmbedding MTEB LlamaIndexRAGSOTALlamaIndexRAG Bilingual and Crosslingual Superiority Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. BCEmbedding, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings. EmbeddingModel supports Chinese (ch) and English (en) (more languages support will come soon), while RerankerModel supports Chinese (ch), English (en), Japanese (ja) and Korean (ko). BCEmbedding EmbeddingModelRerankerModel Key Features Bilingual and Crosslingual Proficiency**: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages. RAG-Optimized: Tailored for diverse RAG tasks including **translation, summarization, and question answering, ensuring accurate query understanding. See RAG Evaluations in LlamaIndex. Efficient and Precise Retrieval**: Dual-encoder for efficient retrieval of EmbeddingModel in first stage, and cross-encoder of RerankerModel for enhanced precision and deeper semantic analysis in second stage. Broad Domain Adaptability**: Trained on diverse datasets for superior performance across various fields. User-Friendly Design**: Instruction-free, versatile use for multiple tasks without specifying query instruction for each task. Meaningful Reranking Scores**: RerankerModel provides relevant scores to improve result quality and optimize large language model performance. Proven in Production**: Successfully implemented and validated in Youdao's products. **BCEmbedding RAGRAG**query understanding LlamaIndexRAG **EmbeddingModelRerankerModel ** ** **RerankerModel **BCEmbedding Latest Updates 2024-01-03_: **Model Releases - bce-embedding-base\_v1 and bce-reranker-base\_v1 are available. 2024-01-03_: **Eval Datasets \[CrosslingualMultiDomainsDataset\] - Evaluate the performence of RAG, using LlamaIndex. 2024-01-03_: **Eval Datasets \[Details\] - Evaluate the performence of crosslingual semantic representation, using MTEB. 2024-01-03: ** - bce-embedding-base\_v1bce-reranker-base\_v1. 2024-01-03: RAG \[CrosslingualMultiDomainsDataset\] - LlamaIndexRAG 2024-01-03: ** \[\] - MTEB. Model List Model Name Model Type Languages Parameters Weights bce-embedding-base\_v1 EmbeddingModel ch, en 279M download bce-reranker-base\_v1 RerankerModel ch, en, ja, ko 279M download Manual Installation First, create a conda environment and activate it. conda create --name bce python=3.10 -y conda activate bce Then install BCEmbedding for minimal installation: pip install BCEmbedding==0.1.1 Or install from source: git clone git@github.com:netease-youdao/BCEmbedding.git cd BCEmbedding pip install -v -e . Quick Start 1\. Based on BCEmbedding Use EmbeddingModel, and cls pooler is default. from BCEmbedding import EmbeddingModel list of sentences sentences = ['sentence_0', 'sentence_1', ...] init embedding model model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1") extract embeddings embeddings = model.encode(sentences) Use RerankerModel to calculate relevant scores and rerank: from BCEmbedding import RerankerModel your query and corresponding passages query = 'input_query' passages = ['passage_0', 'passage_1', ...] construct sentence pairs sentence_pairs = [[query, passage] for passage in passages] init reranker model model = RerankerModel(model_name_or_path="maidalun1020/bce-reranker-base_v1") method 0: calculate scores of sentence pairs scores = model.compute_score(sentence_pairs) method 1: rerank passages rerank_results = model.rerank(query, passages) NOTE: In RerankerModel.rerank method, we provide an advanced preproccess that we use in production for making sentence_pairs, when "passages" are very long. 2\. Based on transformers For EmbeddingModel: from transformers import AutoModel, AutoTokenizer list of sentences sentences = ['sentence_0', 'sentence_1', ...] init model and tokenizer tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1') model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1') device = 'cuda' # if no GPU, set "cpu" model.to(device) get inputs inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt") inputs_on_device = {k: v.to(self.device) for k, v in inputs.items()} get embeddings outputs = model(**inputs_on_device, return_dict=True) embeddings = outputs.last_hidden_state[:, 0] # cls pooler embeddings = embeddings / embeddings.norm(dim=1, keepdim=True) # normalize For RerankerModel: import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification init model and tokenizer tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1') model = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1') device = 'cuda' # if no GPU, set "cpu" model.to(device) get inputs inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt") inputs_on_device = {k: v.to(device) for k, v in inputs.items()} calculate scores scores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float() scores = torch.sigmoid(scores) 3\. Based on sentence_transformers For EmbeddingModel: from sentence_transformers import SentenceTransformer list of sentences sentences = ['sentence_0', 'sentence_1', ...] init embedding model New update for sentence-trnasformers. So clean up your "SENTENCE_TRANSFORMERS_HOME/maidalun1020_bce-embedding-base_v1" or "/.cache/torch/sentence_transformers/maidalun1020_bce-embedding-base_v1" first for downloading new version. model = SentenceTransformer("maidalun1020/bce-embedding-base_v1") extract embeddings embeddings = model.encode(sentences, normalize_embeddings=True) For RerankerModel: from sentence_transformers import CrossEncoder init reranker model model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512) calculate scores of sentence pairs scores = model.predict(sentence_pairs) Integrations for RAG Frameworks 1\. Used in langchain from langchain.embeddings import HuggingFaceEmbeddings from langchain_community.vectorstores import FAISS from langchain_community.vectorstores.utils import DistanceStrategy query = 'apples' passages = [ 'I like apples', 'I like oranges', 'Apples and oranges are fruits' ] init embedding model model_name = 'maidalun1020/bce-embedding-base_v1' model_kwargs = {'device': 'cuda'} encode_kwargs = {'batch_size': 64, 'normalize_embeddings': True, 'show_progress_bar': False} embed_model = HuggingFaceEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs ) example #1. extract embeddings query_embedding = embed_model.embed_query(query) passages_embeddings = embed_model.embed_documents(passages) example #2. langchain retriever example faiss_vectorstore = FAISS.from_texts(passages, embed_model, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT) retriever = faiss_vectorstore.as_retriever(search_type="similarity", search_kwargs={"score_threshold": 0.5, "k": 3}) related_passages = retriever.get_relevant_documents(query) 2\. Used in llama_index from llama_index.embeddings import HuggingFaceEmbedding from llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader from llama_index.node_parser import SimpleNodeParser from llama_index.llms import OpenAI query = 'apples' passages = [ 'I like apples', 'I like oranges', 'Apples and oranges are fruits' ] init embedding model model_args = {'model_name': 'maidalun1020/bce-embedding-base_v1', 'max_length': 512, 'embed_batch_size': 64, 'device': 'cuda'} embed_model = HuggingFaceEmbedding(**model_args) example #1. extract embeddings query_embedding = embed_model.get_query_embedding(query) passages_embeddings = embed_model.get_text_embedding_batch(passages) example #2. rag example llm = OpenAI(model='gpt-3.5-turbo-0613', api_key=os.environ.get('OPENAI_API_KEY'), api_base=os.environ.get('OPENAI_BASE_URL')) service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model) documents = SimpleDirectoryReader(input_files=["BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf"]).load_data() node_parser = SimpleNodeParser.from_defaults(chunk_size=512) nodes = node_parser.get_nodes_from_documents(documents[0:36]) index = VectorStoreIndex(nodes, service_context=service_context) query_engine = index.as_query_engine() response = query_engine.query("What is llama?") Evaluation Evaluate Semantic Representation by MTEB We provide evaluateion tools for embedding and reranker models, based on MTEB and C\_MTEB. MTEBC\_MTEBembeddingreranker 1\. Embedding Models Just run following cmd to evaluate your_embedding_model (e.g. maidalun1020/bce-embedding-base_v1) in bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]). your_embedding_modelmaidalun1020/bce-embedding-base_v1**["en", "zh", "en-zh", "zh-en"] python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls The total evaluation tasks contain 114 datastes of "Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering". "Retrieval" "STS" "PairClassification" "Classification" "Reranking""Clustering" 114 NOTE:**_ All models are evaluated in their recommended pooling method (pooler)**. mean pooler: "jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "e5-large-v2", "multilingual-e5-base", "multilingual-e5-large" and "gte-large". cls pooler: Other models. "jina-embeddings-v2-base-en" model should be loaded with trust_remote_code. python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {moka-ai/m3e-base | moka-ai/m3e-large} --pooler mean python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path jinaai/jina-embeddings-v2-base-en --pooler mean --trust_remote_code pooler"jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "e5-large-v2", "multilingual-e5-base", "multilingual-e5-large""gte-large" poolermeanpoolercls. "jina-embeddings-v2-base-en"trust_remote_code 2\. Reranker Models Run following cmd to evaluate your_reranker_model (e.g. "maidalun1020/bce-reranker-base\_v1") in bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]). your_reranker_modelmaidalun1020/bce-reranker-base_v1 **["en", "zh", "en-zh", "zh-en"] python BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1 The evaluation tasks contain 12 datastes of "Reranking". "Reranking" 12 3\. Metrics Visualization Tool We proveide a one-click script to sumarize evaluation results of embedding and reranker models as Embedding Models Evaluation Summary and Reranker Models Evaluation Summary. embeddingrerankermarkdownEmbeddingReranker python BCEmbedding/evaluation/mteb/summarize_eval_results.py --results_dir {your_embedding_results_dir | your_reranker_results_dir} Evaluate RAG by LlamaIndex LlamaIndex is a famous data framework for LLM-based applications, particularly in RAG. Recently, the LlamaIndex Blog has evaluated the popular embedding and reranker models in RAG pipeline and attract great attention. Now, we follow its pipeline to evaluate our BCEmbedding. LlamaIndexRAGLlamaIndexembeddingrerankerRAGBCEmbeddingRAG First, install LlamaIndex: pip install llama-index==0.9.22 1\. Metrics Definition Hit Rate: Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it's about how often our system gets it right within the top few guesses. The larger, the better. Mean Reciprocal Rank (MRR): For each query, MRR evaluates the system's accuracy by looking at the rank of the highest-placed relevant document. Specifically, it's the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it's second, the reciprocal rank is 1/2, and so on. The larger, the better. Hit Rate k_**_ Mean Reciprocal RankMRR MRR11/2_**_ 2\. Reproduce LlamaIndex Blog In order to compare our BCEmbedding with other embedding and reranker models fairly, we provide a one-click script to reproduce results of the LlamaIndex Blog, including our BCEmbedding: LlamaIndexBCEmbeddingembeddingreranker There should be two GPUs available at least. CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_reproduce.py Then, sumarize the evaluation results by: python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir results/rag_reproduce_results Results Reproduced from the LlamaIndex Blog can be checked in Reproduced Summary of RAG Evaluation, with some obvious conclusions: In WithoutReranker setting, our bce-embedding-base_v1 outperforms all the other embedding models. With fixing the embedding model, our bce-reranker-base_v1 achieves the best performence. The combination of bce-embedding-base_v1 and bce-reranker-base_v1 is SOTA.**_ \\\LlamaIndex RAG\\\ WithoutReranker**bce-embedding-base_v1embedding embeddingreranker**bce-reranker-base_v1reranker bce-embedding-base_v1bce-reranker-base_v1SOTA 3\. Broad Domain Adaptability The evaluation of LlamaIndex Blog is monolingual, small amount of data, and specific domain (just including "llama2" paper). In order to evaluate the broad domain adaptability, bilingual and crosslingual capability, we follow the blog to build a multiple domains evaluation dataset (includding "Computer Science", "Physics", "Biology", "Economics", "Math", and "Quantitative Finance"), named CrosslingualMultiDomainsDataset, by OpenAI gpt-4-1106-preview for high quality. LlamaIndexllama2 * *CrosslingualMultiDomainsDatasetOpenAIgpt-4-1106-preview First, run following cmd to evaluate the most popular and powerful embedding and reranker models: There should be two GPUs available at least. CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_multiple_domains.py Then, run the following script to sumarize the evaluation results: python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir results/rag_results The summary of multiple domains evaluations can be seen in Multiple Domains Scenarios. Leaderboard Semantic Representation Evaluations in MTEB 1\. Embedding Models Model Dimensions Pooler Instructions Retrieval (47) STS (19) PairClassification (5) Classification (21) Reranking (12) Clustering (15) AVG**_ (119) bge-base-en-v1.5 768 cls Need 37.14 55.06 75.45 59.73 43.00 37.74 47.19 bge-base-zh-v1.5 768 cls Need 47.63 63.72 77.40 63.38 54.95 32.56 53.62 bge-large-en-v1.5 1024 cls Need 37.18 54.09 75.00 59.24 42.47 37.32 46.80 bge-large-zh-v1.5 1024 cls Need 47.58 64.73 79.14 64.19 55.98 33.26 54.23 e5-large-v2 1024 mean Need 35.98 55.23 75.28 59.53 42.12 36.51 46.52 gte-large 1024 mean Free 36.68 55.22 74.29 57.73 42.44 38.51 46.67 gte-large-zh 1024 cls Free 41.15 64.62 77.58 62.04 55.62 33.03 51.51 jina-embeddings-v2-base-en 768 mean Free 31.58 54.28 74.84 58.42 41.16 34.67 44.29 m3e-base 768 mean Free 46.29 63.93 71.84 64.08 52.38 37.84 53.54 m3e-large 1024 mean Free 34.85 59.74 67.69 60.07 48.99 31.62 46.78 multilingual-e5-base 768 mean Need 54.73 65.49 76.97 69.72 55.01 38.44 58.34 multilingual-e5-large 1024 mean Need 56.76 66.79 78.80 71.61 56.49 43.09 60.50 bce-embedding-base\v1** 768 cls Free 57.60 65.73 74.96 69.00 57.29 38.95 59.43 NOTE:**_ Our bce-embedding-base\_v1 outperforms other opensource embedding models with comparable model size. 114 datastes_ of **"Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering" in ["en", "zh", "en-zh", "zh-en"] setting. The crosslingual evaluation datasets we released belong to Retrieval task. More evaluation details please check Embedding Models Evaluation Summary. ** embedding_bce-embedding-base\v1 large "Retrieval" "STS" "PairClassification" "Classification" "Reranking""Clustering" 114 Retrieval Embedding 2\. Reranker Models Model Reranking (12) AVG**_ (12) bge-reranker-base 59.04 59.04 bge-reranker-large 60.86 60.86 bce-reranker-base\v1** 61.29 61.29**_ NOTE:**_ Our bce-reranker-base\_v1 outperforms other opensource reranker models. 12 datastes_ of **"Reranking" in ["en", "zh", "en-zh", "zh-en"] setting. More evaluation details please check Reranker Models Evaluation Summary. ** bce-reranker-base\_v1 reranker "Reranking" 12 Reranker RAG Evaluations in LlamaIndex 1\. Multiple Domains Scenarios image/jpeg NOTE:**_ Evaluated in ["en", "zh", "en-zh", "zh-en"] setting. In WithoutReranker setting, our bce-embedding-base_v1 outperforms all the other embedding models. With fixing the embedding model, our bce-reranker-base_v1 achieves the best performence. The combination of bce-embedding-base_v1 and bce-reranker-base_v1 is SOTA**. ** ["en", "zh", "en-zh", "zh-en"] WithoutReranker**bce-embedding-base_v1Embedding Embeddingreranker**bce-reranker-base_v1reranker bce-embedding-base_v1bce-reranker-base_v1SOTA Youdao's BCEmbedding API For users who prefer a hassle-free experience without the need to download and configure the model on their own systems, BCEmbedding is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at Youdao BCEmbedding API. Here, you'll find all the necessary guidance to easily implement BCEmbedding across a variety of use cases, ensuring a smooth and effective integration for optimal results. apiBCEmbeddingapiBCEmbeddingapiBCEmbedding API WeChat Group Welcome to scan the QR code below and join the WeChat group. image/jpeg Citation If you use BCEmbedding in your research or project, please feel free to cite and star it: @misc{youdao_bcembedding_2023, title={BCEmbedding: Bilingual and Crosslingual Embedding for RAG}, author={NetEase Youdao, Inc.}, year={2023}, howpublished={\url{https://github.com/netease-youdao/BCEmbedding}} } License BCEmbedding is licensed under Apache 2.0 License Related Links Netease Youdao - QAnything FlagEmbedding MTEB C\_MTEB LLama Index | LlamaIndex Blog

Read more

Updated 4/28/2024

🌿

bce-reranker-base_v1

maidalun1020

Total Score

95

BCEmbedding: Bilingual and Crosslingual Embedding for RAG bce-reranker-base\_v1The latest "Updates" should be checked in GitHub (Key Features) (Multilingual and Crosslingual capability in English, Chinese, Japanese and Korean) RAG(RAG adaptation for more domains, including Education, Law, Finance, Medical, Literature, FAQ, Textbook, Wikipedia, etc.) BCEmbeddingrerank(Handle long passages reranking more than 512 limit in BCEmbedding) RerankerModel *passagepassage0.350.4RerankerModel provides *"smooth" (for reranking) and "meaningful" (for filtering bad passages with a threshold of 0.35 or 0.4) similarity score**, which help you figure out how relavent the query and passages are! Best practice** embeddingtop50-100reranker50-100top5-101. Get top 50-100 passages with bce-embedding-base\_v1 for "recall" 2. Rerank passages with bce-reranker-base\_v1 and get top 5-10 for "precision" finally. News: BCEmbedding Technical Blog : RAG-BCEmbedding Related link for EmbeddingModel : bce-embedding-base\_v1 Third-party Examples: RAG applications: QAnything, HuixiangDou, ChatPDF. Efficient inference framework: ChatLLM.cpp, Xinference, mindnlp (Huawei GPU, GPU). image/jpeg image/jpeg Click to Open Contents Bilingual and Crosslingual Superiority Key Features Latest Updates Model List Manual Installation Quick Start (transformers, sentence-transformers) Integrations for RAG Frameworks (langchain, llama_index) Evaluation Evaluate Semantic Representation by MTEB Evaluate RAG by LlamaIndex Leaderboard Semantic Representation Evaluations in MTEB RAG Evaluations in LlamaIndex Youdao's BCEmbedding API WeChat Group Citation License Related Links Bilingual and Crosslingual Embedding (BCEmbedding), developed by NetEase Youdao, encompasses EmbeddingModel and RerankerModel. The EmbeddingModel specializes in generating semantic vectors, playing a crucial role in semantic search and question-answering, and the RerankerModel excels at refining search results and ranking tasks. BCEmbedding serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implmentation, notably QAnything \[github\], an open-source implementation widely integrated in various Youdao products like Youdao Speed Reading and Youdao Translation. Distinguished for its bilingual and crosslingual proficiency, BCEmbedding excels in bridging Chinese and English linguistic gaps, which achieves A high performence on Semantic Representation Evaluations in MTEB**; A new benchmark in the realm of RAG Evaluations in LlamaIndex**. BCEmbeddingEmbeddingModelRerankerModelEmbeddingModelRerankerModel BCEmbeddingRAGQAnything \[github\]QAnything BCEmbedding MTEB LlamaIndexRAGSOTALlamaIndexRAG Bilingual and Crosslingual Superiority Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. BCEmbedding, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings. EmbeddingModel supports Chinese (ch) and English (en) (more languages support will come soon), while RerankerModel supports Chinese (ch), English (en), Japanese (ja) and Korean (ko). BCEmbedding EmbeddingModelRerankerModel Key Features Bilingual and Crosslingual Proficiency**: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages. RAG-Optimized: Tailored for diverse RAG tasks including **translation, summarization, and question answering, ensuring accurate query understanding. See RAG Evaluations in LlamaIndex. Efficient and Precise Retrieval**: Dual-encoder for efficient retrieval of EmbeddingModel in first stage, and cross-encoder of RerankerModel for enhanced precision and deeper semantic analysis in second stage. Broad Domain Adaptability**: Trained on diverse datasets for superior performance across various fields. User-Friendly Design**: Instruction-free, versatile use for multiple tasks without specifying query instruction for each task. Meaningful Reranking Scores**: RerankerModel provides relevant scores to improve result quality and optimize large language model performance. Proven in Production**: Successfully implemented and validated in Youdao's products. **BCEmbedding RAGRAG**query understanding LlamaIndexRAG **EmbeddingModelRerankerModel ** ** **RerankerModel **BCEmbedding Latest Updates 2024-01-03_: **Model Releases - bce-embedding-base\_v1 and bce-reranker-base\_v1 are available. 2024-01-03_: **Eval Datasets \[CrosslingualMultiDomainsDataset\] - Evaluate the performence of RAG, using LlamaIndex. 2024-01-03_: **Eval Datasets \[Details\] - Evaluate the performence of crosslingual semantic representation, using MTEB. 2024-01-03: ** - bce-embedding-base\_v1bce-reranker-base\_v1. 2024-01-03: RAG \[CrosslingualMultiDomainsDataset\] - LlamaIndexRAG 2024-01-03: ** \[\] - MTEB. Model List Model Name Model Type Languages Parameters Weights bce-embedding-base\_v1 EmbeddingModel ch, en 279M download bce-reranker-base\_v1 RerankerModel ch, en, ja, ko 279M download Manual Installation First, create a conda environment and activate it. conda create --name bce python=3.10 -y conda activate bce Then install BCEmbedding for minimal installation: pip install BCEmbedding==0.1.1 Or install from source: git clone git@github.com:netease-youdao/BCEmbedding.git cd BCEmbedding pip install -v -e . Quick Start 1\. Based on BCEmbedding Use EmbeddingModel, and cls pooler is default. from BCEmbedding import EmbeddingModel list of sentences sentences = ['sentence_0', 'sentence_1', ...] init embedding model model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1") extract embeddings embeddings = model.encode(sentences) Use RerankerModel to calculate relevant scores and rerank: from BCEmbedding import RerankerModel your query and corresponding passages query = 'input_query' passages = ['passage_0', 'passage_1', ...] construct sentence pairs sentence_pairs = [[query, passage] for passage in passages] init reranker model model = RerankerModel(model_name_or_path="maidalun1020/bce-reranker-base_v1") method 0: calculate scores of sentence pairs scores = model.compute_score(sentence_pairs) method 1: rerank passages rerank_results = model.rerank(query, passages) NOTE: In RerankerModel.rerank method, we provide an advanced preproccess that we use in production for making sentence_pairs, when "passages" are very long. 2\. Based on transformers For EmbeddingModel: from transformers import AutoModel, AutoTokenizer list of sentences sentences = ['sentence_0', 'sentence_1', ...] init model and tokenizer tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1') model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1') device = 'cuda' # if no GPU, set "cpu" model.to(device) get inputs inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt") inputs_on_device = {k: v.to(self.device) for k, v in inputs.items()} get embeddings outputs = model(**inputs_on_device, return_dict=True) embeddings = outputs.last_hidden_state[:, 0] # cls pooler embeddings = embeddings / embeddings.norm(dim=1, keepdim=True) # normalize For RerankerModel: import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification init model and tokenizer tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1') model = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1') device = 'cuda' # if no GPU, set "cpu" model.to(device) get inputs inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt") inputs_on_device = {k: v.to(device) for k, v in inputs.items()} calculate scores scores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float() scores = torch.sigmoid(scores) 3\. Based on sentence_transformers For EmbeddingModel: from sentence_transformers import SentenceTransformer list of sentences sentences = ['sentence_0', 'sentence_1', ...] init embedding model New update for sentence-trnasformers. So clean up your "SENTENCE_TRANSFORMERS_HOME/maidalun1020_bce-embedding-base_v1" or "/.cache/torch/sentence_transformers/maidalun1020_bce-embedding-base_v1" first for downloading new version. model = SentenceTransformer("maidalun1020/bce-embedding-base_v1") extract embeddings embeddings = model.encode(sentences, normalize_embeddings=True) For RerankerModel: from sentence_transformers import CrossEncoder init reranker model model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512) calculate scores of sentence pairs scores = model.predict(sentence_pairs) Integrations for RAG Frameworks 1\. Used in langchain from langchain.embeddings import HuggingFaceEmbeddings from langchain_community.vectorstores import FAISS from langchain_community.vectorstores.utils import DistanceStrategy query = 'apples' passages = [ 'I like apples', 'I like oranges', 'Apples and oranges are fruits' ] init embedding model model_name = 'maidalun1020/bce-embedding-base_v1' model_kwargs = {'device': 'cuda'} encode_kwargs = {'batch_size': 64, 'normalize_embeddings': True, 'show_progress_bar': False} embed_model = HuggingFaceEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs ) example #1. extract embeddings query_embedding = embed_model.embed_query(query) passages_embeddings = embed_model.embed_documents(passages) example #2. langchain retriever example faiss_vectorstore = FAISS.from_texts(passages, embed_model, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT) retriever = faiss_vectorstore.as_retriever(search_type="similarity", search_kwargs={"score_threshold": 0.5, "k": 3}) related_passages = retriever.get_relevant_documents(query) 2\. Used in llama_index from llama_index.embeddings import HuggingFaceEmbedding from llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader from llama_index.node_parser import SimpleNodeParser from llama_index.llms import OpenAI query = 'apples' passages = [ 'I like apples', 'I like oranges', 'Apples and oranges are fruits' ] init embedding model model_args = {'model_name': 'maidalun1020/bce-embedding-base_v1', 'max_length': 512, 'embed_batch_size': 64, 'device': 'cuda'} embed_model = HuggingFaceEmbedding(**model_args) example #1. extract embeddings query_embedding = embed_model.get_query_embedding(query) passages_embeddings = embed_model.get_text_embedding_batch(passages) example #2. rag example llm = OpenAI(model='gpt-3.5-turbo-0613', api_key=os.environ.get('OPENAI_API_KEY'), api_base=os.environ.get('OPENAI_BASE_URL')) service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model) documents = SimpleDirectoryReader(input_files=["BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf"]).load_data() node_parser = SimpleNodeParser.from_defaults(chunk_size=512) nodes = node_parser.get_nodes_from_documents(documents[0:36]) index = VectorStoreIndex(nodes, service_context=service_context) query_engine = index.as_query_engine() response = query_engine.query("What is llama?") Evaluation Evaluate Semantic Representation by MTEB We provide evaluateion tools for embedding and reranker models, based on MTEB and C\_MTEB. MTEBC\_MTEBembeddingreranker 1\. Embedding Models Just run following cmd to evaluate your_embedding_model (e.g. maidalun1020/bce-embedding-base_v1) in bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]). your_embedding_modelmaidalun1020/bce-embedding-base_v1**["en", "zh", "en-zh", "zh-en"] python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls The total evaluation tasks contain 114 datastes of "Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering". "Retrieval" "STS" "PairClassification" "Classification" "Reranking""Clustering" 114 NOTE:**_ All models are evaluated in their recommended pooling method (pooler)**. mean pooler: "jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "e5-large-v2", "multilingual-e5-base", "multilingual-e5-large" and "gte-large". cls pooler: Other models. "jina-embeddings-v2-base-en" model should be loaded with trust_remote_code. python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {moka-ai/m3e-base | moka-ai/m3e-large} --pooler mean python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path jinaai/jina-embeddings-v2-base-en --pooler mean --trust_remote_code pooler"jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "e5-large-v2", "multilingual-e5-base", "multilingual-e5-large""gte-large" poolermeanpoolercls. "jina-embeddings-v2-base-en"trust_remote_code 2\. Reranker Models Run following cmd to evaluate your_reranker_model (e.g. "maidalun1020/bce-reranker-base\_v1") in bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]). your_reranker_modelmaidalun1020/bce-reranker-base_v1 **["en", "zh", "en-zh", "zh-en"] python BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1 The evaluation tasks contain 12 datastes of "Reranking". "Reranking" 12 3\. Metrics Visualization Tool We proveide a one-click script to sumarize evaluation results of embedding and reranker models as Embedding Models Evaluation Summary and Reranker Models Evaluation Summary. embeddingrerankermarkdownEmbeddingReranker python BCEmbedding/evaluation/mteb/summarize_eval_results.py --results_dir {your_embedding_results_dir | your_reranker_results_dir} Evaluate RAG by LlamaIndex LlamaIndex is a famous data framework for LLM-based applications, particularly in RAG. Recently, the LlamaIndex Blog has evaluated the popular embedding and reranker models in RAG pipeline and attract great attention. Now, we follow its pipeline to evaluate our BCEmbedding. LlamaIndexRAGLlamaIndexembeddingrerankerRAGBCEmbeddingRAG First, install LlamaIndex: pip install llama-index==0.9.22 1\. Metrics Definition Hit Rate: Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it's about how often our system gets it right within the top few guesses. The larger, the better. Mean Reciprocal Rank (MRR): For each query, MRR evaluates the system's accuracy by looking at the rank of the highest-placed relevant document. Specifically, it's the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it's second, the reciprocal rank is 1/2, and so on. The larger, the better. Hit Rate k_**_ Mean Reciprocal RankMRR MRR11/2_**_ 2\. Reproduce LlamaIndex Blog In order to compare our BCEmbedding with other embedding and reranker models fairly, we provide a one-click script to reproduce results of the LlamaIndex Blog, including our BCEmbedding: LlamaIndexBCEmbeddingembeddingreranker There should be two GPUs available at least. CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_reproduce.py Then, sumarize the evaluation results by: python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir results/rag_reproduce_results Results Reproduced from the LlamaIndex Blog can be checked in Reproduced Summary of RAG Evaluation, with some obvious conclusions: In WithoutReranker setting, our bce-embedding-base_v1 outperforms all the other embedding models. With fixing the embedding model, our bce-reranker-base_v1 achieves the best performence. The combination of bce-embedding-base_v1 and bce-reranker-base_v1 is SOTA.**_ \\\LlamaIndex RAG\\\ WithoutReranker**bce-embedding-base_v1embedding embeddingreranker**bce-reranker-base_v1reranker bce-embedding-base_v1bce-reranker-base_v1SOTA 3\. Broad Domain Adaptability The evaluation of LlamaIndex Blog is monolingual, small amount of data, and specific domain (just including "llama2" paper). In order to evaluate the broad domain adaptability, bilingual and crosslingual capability, we follow the blog to build a multiple domains evaluation dataset (includding "Computer Science", "Physics", "Biology", "Economics", "Math", and "Quantitative Finance"), named CrosslingualMultiDomainsDataset, by OpenAI gpt-4-1106-preview for high quality. LlamaIndexllama2 * *CrosslingualMultiDomainsDatasetOpenAIgpt-4-1106-preview First, run following cmd to evaluate the most popular and powerful embedding and reranker models: There should be two GPUs available at least. CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_multiple_domains.py Then, run the following script to sumarize the evaluation results: python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir results/rag_results The summary of multiple domains evaluations can be seen in Multiple Domains Scenarios. Leaderboard Semantic Representation Evaluations in MTEB 1\. Embedding Models Model Dimensions Pooler Instructions Retrieval (47) STS (19) PairClassification (5) Classification (21) Reranking (12) Clustering (15) AVG**_ (119) bge-base-en-v1.5 768 cls Need 37.14 55.06 75.45 59.73 43.00 37.74 47.19 bge-base-zh-v1.5 768 cls Need 47.63 63.72 77.40 63.38 54.95 32.56 53.62 bge-large-en-v1.5 1024 cls Need 37.18 54.09 75.00 59.24 42.47 37.32 46.80 bge-large-zh-v1.5 1024 cls Need 47.58 64.73 79.14 64.19 55.98 33.26 54.23 e5-large-v2 1024 mean Need 35.98 55.23 75.28 59.53 42.12 36.51 46.52 gte-large 1024 mean Free 36.68 55.22 74.29 57.73 42.44 38.51 46.67 gte-large-zh 1024 cls Free 41.15 64.62 77.58 62.04 55.62 33.03 51.51 jina-embeddings-v2-base-en 768 mean Free 31.58 54.28 74.84 58.42 41.16 34.67 44.29 m3e-base 768 mean Free 46.29 63.93 71.84 64.08 52.38 37.84 53.54 m3e-large 1024 mean Free 34.85 59.74 67.69 60.07 48.99 31.62 46.78 multilingual-e5-base 768 mean Need 54.73 65.49 76.97 69.72 55.01 38.44 58.34 multilingual-e5-large 1024 mean Need 56.76 66.79 78.80 71.61 56.49 43.09 60.50 bce-embedding-base\v1** 768 cls Free 57.60 65.73 74.96 69.00 57.29 38.95 59.43 NOTE:**_ Our bce-embedding-base\_v1 outperforms other opensource embedding models with comparable model size. 114 datastes_ of **"Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering" in ["en", "zh", "en-zh", "zh-en"] setting. The crosslingual evaluation datasets we released belong to Retrieval task. More evaluation details please check Embedding Models Evaluation Summary. ** embedding_bce-embedding-base\v1 large "Retrieval" "STS" "PairClassification" "Classification" "Reranking""Clustering" 114 Retrieval Embedding 2\. Reranker Models Model Reranking (12) AVG**_ (12) bge-reranker-base 59.04 59.04 bge-reranker-large 60.86 60.86 bce-reranker-base\v1** 61.29 61.29**_ NOTE:**_ Our bce-reranker-base\_v1 outperforms other opensource reranker models. 12 datastes_ of **"Reranking" in ["en", "zh", "en-zh", "zh-en"] setting. More evaluation details please check Reranker Models Evaluation Summary. ** bce-reranker-base\_v1 reranker "Reranking" 12 Reranker RAG Evaluations in LlamaIndex 1\. Multiple Domains Scenarios image/jpeg NOTE:**_ Evaluated in \["en", "zh", "en-zh", "zh-en"\] setting. In WithoutReranker setting, our bce-embedding-base_v1 outperforms all the other embedding models. With fixing the embedding model, our bce-reranker-base_v1 achieves the best performence. The combination of bce-embedding-base_v1 and bce-reranker-base_v1 is SOTA**. ** \["en", "zh", "en-zh", "zh-en"\] WithoutReranker**bce-embedding-base_v1Embedding Embeddingreranker**bce-reranker-base_v1reranker bce-embedding-base_v1bce-reranker-base_v1SOTA Youdao's BCEmbedding API For users who prefer a hassle-free experience without the need to download and configure the model on their own systems, BCEmbedding is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at Youdao BCEmbedding API. Here, you'll find all the necessary guidance to easily implement BCEmbedding across a variety of use cases, ensuring a smooth and effective integration for optimal results. apiBCEmbeddingapiBCEmbeddingapiBCEmbedding API WeChat Group Welcome to scan the QR code below and join the WeChat group. image/jpeg Citation If you use BCEmbedding in your research or project, please feel free to cite and star it: @misc{youdao_bcembedding_2023, title={BCEmbedding: Bilingual and Crosslingual Embedding for RAG}, author={NetEase Youdao, Inc.}, year={2023}, howpublished={\url{https://github.com/netease-youdao/BCEmbedding}} } License BCEmbedding is licensed under Apache 2.0 License Related Links Netease Youdao - QAnything FlagEmbedding MTEB C\_MTEB LLama Index | LlamaIndex Blog

Read more

Updated 4/28/2024