[](#-m3e-models) M3E Models
===============================

[m3e-small](https://huggingface.co/moka-ai/m3e-small) | [m3e-base](https://huggingface.co/moka-ai/m3e-base)

M3E  Moka Massive Mixed Embedding 

*   Moka MokaAI  [uniem](https://github.com/wangyuxinwhy/uniem/blob/main/scripts/train_m3e.py)  BenchMark  [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh)
*   Massive**** (2200w+) 
*   Mixed
*   Embedding

[](#-) 
-------------------

*   2023.06.24 M3E  [notebook](https://github.com/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb)[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb)
*   2023.06.14 UER, ErLangShen, DMetaSoul
*   2023.06.08 T2Ranking 1W m3e-base  ndcg@10  0.8004 openai-ada-002  0.7786
*   2023.06.07 6 m3e-base  accuracy  0.6157 openai-ada-002  0.5956

[](#-) 
-------------------









s2s

s2p

s2c





s2s Acc

s2p ndcg@10

m3e-small

24M

512















0.5834

0.7262

m3e-base

110M

768















**0.6157**

**0.8004**

text2vec

110M

768















0.5755

0.6346

openai-ada-002



1536















0.5956

0.7786



*   s2s,  sentence to sentence 
*   s2p,  sentence to passage GPT 
*   s2c,  sentence to code 
*    m3e  text2vec  sentence-transformers  openai 
*   ACC & ndcg@10

Tips:

*    m3e 
*    openai text-embedding-ada-002
*    openai text-embedding-ada-002
*    S2S 

[](#--m3e)  M3E
-----------------------

 sentence-transformers

    pip install -U sentence-transformers
    

 M3E Models

    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('moka-ai/m3e-base')
    
    #Our sentences we like to encode
    sentences = [
        '* Moka  MokaAI  uniem',
        '* Massive ****',
        '* Mixed ALL in one'
    ]
    
    #Sentences are encoded by calling model.encode()
    embeddings = model.encode(sentences)
    
    #Print the embeddings
    for sentence, embedding in zip(sentences, embeddings):
        print("Sentence:", sentence)
        print("Embedding:", embedding)
        print("")
    

M3E  [sentence-transformers](https://www.sbert.net/) **** sentence-transformers **** M3E Models [chroma](https://docs.trychroma.com/getting-started), [guidance](https://github.com/microsoft/guidance), [semantic-kernel](https://github.com/microsoft/semantic-kernel) 

[](#-) 
-------------------

`uniem`  finetune 

    from datasets import load_dataset
    
    from uniem.finetuner import FineTuner
    
    dataset = load_dataset('shibing624/nli_zh', 'STS-B')
    #  m3e-small
    finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)
    finetuner.run(epochs=1)
    

 [uniem ](https://github.com/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb)

[](#-) 
-----------------

M3E  in-batch  in-batch  A100 80G  batch-size 2200W+  1 epoch [uniem](https://github.com/wangyuxinwhy/uniem/blob/main/scripts/train_m3e.py)

[](#-) 
---------------

*   M3E  2200W  [M3E ](#M3E%E6%95%B0%E6%8D%AE%E9%9B%86)
*   M3E  MEDI 145W  [MEDI ](https://drive.google.com/file/d/1vZ5c2oJNonGOvXzppNg5mHz24O6jcc52/view) [instructor team](https://github.com/HKUNLP/instructor-embedding) 
*   M3E  300W +  M3E  [instructor-embedding](https://github.com/HKUNLP/instructor-embedding)
*   M3E  hfl  [Roberta](https://huggingface.co/hfl/chinese-roberta-wwm-ext)  small  base 
*   ALL IN ONEM3E  ALL IN ONE 

[](#-mteb-zh-) MTEB-zh 
-------------------------------

*   [text2vec](https://github.com/shibing624/text2vec), m3e-base, m3e-small, openai text-embedding-ada-002, [DMetaSoul](https://huggingface.co/DMetaSoul/sbert-chinese-general-v2), [UER](https://huggingface.co/uer/sbert-base-chinese-nli), [ErLangShen](https://huggingface.co/IDEA-CCNL/Erlangshen-SimCSE-110M-Chinese)
*    \[MTEB-zh\] ([https://github.com/wangyuxinwhy/uniem/blob/main/mteb-zh](https://github.com/wangyuxinwhy/uniem/blob/main/mteb-zh))

### [](#)

*    HuggingFace  6 
*    MTEB  Accuracy

text2vec

m3e-small

m3e-base

openai

DMetaSoul

uer

erlangshen

TNews

0.43

0.4443

**0.4827**

0.4594

0.3084

0.3539

0.4361

JDIphone

0.8214

0.8293

**0.8533**

0.746

0.7972

0.8283

0.8356

GubaEastmony

0.7472

0.712

0.7621

0.7574

0.735

0.7534

**0.7787**

TYQSentiment

0.6099

0.6596

**0.7188**

0.68

0.6437

0.6662

0.6444

StockComSentiment

0.4307

0.4291

0.4363

**0.4819**

0.4309

0.4555

0.4482

IFlyTek

0.414

0.4263

0.4409

**0.4486**

0.3969

0.3762

0.4241

Average

0.5755

0.5834

**0.6157**

0.5956

0.552016667

0.57225

0.594516667

### [](#)

#### [](#t2ranking-1w)T2Ranking 1W

*    [T2Ranking](https://github.com/THUIR/T2Ranking/tree/main)  T2Ranking openai  api  T2Ranking  10000 
*    MTEB  map@1, map@10, mrr@1, mrr@10, ndcg@1, ndcg@10
*    M3E  openai 

text2vec

openai-ada-002

m3e-small

m3e-base

DMetaSoul

uer

erlangshen

map@1

0.4684

0.6133

0.5574

**0.626**

0.25203

0.08647

0.25394

map@10

0.5877

0.7423

0.6878

**0.7656**

0.33312

0.13008

0.34714

mrr@1

0.5345

0.6931

0.6324

**0.7047**

0.29258

0.10067

0.29447

mrr@10

0.6217

0.7668

0.712

**0.7841**

0.36287

0.14516

0.3751

ndcg@1

0.5207

0.6764

0.6159

**0.6881**

0.28358

0.09748

0.28578

ndcg@10

0.6346

0.7786

0.7262

**0.8004**

0.37468

0.15783

0.39329

#### [](#t2ranking)T2Ranking

*    T2Ranking openai-ada-002  T2Ranking 10W  T2Ranking 50W T2Ranking ... 128G 
*    MTEB  ndcg@10

text2vec

m3e-small

m3e-base

t2r-1w

0.6346

0.72621

**0.8004**

t2r-10w

0.44644

0.5251

**0.6263**

t2r-50w

0.33482

0.38626

**0.47364**



*    text2vec  text2vec 

[](#-m3e) M3E
-----------------------

 [uniem process\_zh\_datasets](https://github.com/wangyuxinwhy/uniem/blob/main/scripts/process_zh_datasets.py)  huggingface  huggingface 









Prompt







/





Done

URL



cmrc2018



14,363







Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, Guoping Hu

[https://github.com/ymcui/cmrc2018/blob/master/README\_CN.md](https://github.com/ymcui/cmrc2018/blob/master/README_CN.md) 









[https://huggingface.co/datasets/cmrc2018](https://huggingface.co/datasets/cmrc2018)



belle\_2m



2,000,000







LianjiaTech/BELLE

belle  self instruct  gpt3.5 









[https://huggingface.co/datasets/BelleGroup/train\_2M\_CN](https://huggingface.co/datasets/BelleGroup/train_2M_CN)



firefily



1,649,399







YeungNLP

Firefly Instruction TuningZeRO 









[https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)



alpaca\_gpt4



48,818







Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, Jianfeng Gao

AlpacaGPT4self-instruct5









[https://huggingface.co/datasets/shibing624/alpaca-zh](https://huggingface.co/datasets/shibing624/alpaca-zh)



zhihu\_kol



1,006,218







wangrui6











[https://huggingface.co/datasets/wangrui6/Zhihu-KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL)



hc3\_chinese



39,781







Hello-SimpleAI

 GPT 









[https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese)



amazon\_reviews\_multi



210,000

 

















[https://huggingface.co/datasets/amazon\_reviews\_multi/viewer/zh/train?row=8](https://huggingface.co/datasets/amazon_reviews_multi/viewer/zh/train?row=8)



mlqa



85,853







patrickvonplaten











[https://huggingface.co/datasets/mlqa/viewer/mlqa-translate-train.zh/train?p=2](https://huggingface.co/datasets/mlqa/viewer/mlqa-translate-train.zh/train?p=2)



xlsum



93,404







BUET CSE NLP Group

BBC









[https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/chinese\_simplified/train?row=259](https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/chinese_simplified/train?row=259)



ocnli



17,726







Thomas Wolf











[https://huggingface.co/datasets/clue/viewer/ocnli](https://huggingface.co/datasets/clue/viewer/ocnli)



BQ



60,000







Intelligent Computing Research Center, Harbin Institute of Technology(Shenzhen)

[http://icrc.hitsz.edu.cn/info/1037/1162.htm](http://icrc.hitsz.edu.cn/info/1037/1162.htm) BQ  120000 100000 10000 10000   









[https://huggingface.co/datasets/shibing624/nli\_zh/viewer/BQ](https://huggingface.co/datasets/shibing624/nli_zh/viewer/BQ)



lcqmc



149,226







Ming Xu

LCQMC  COLING2018 









[https://huggingface.co/datasets/shibing624/nli\_zh/viewer/LCQMC/train](https://huggingface.co/datasets/shibing624/nli_zh/viewer/LCQMC/train)



paws-x



23,576







Bhavitvya Malik

PAWS Wiki









[https://huggingface.co/datasets/paws-x/viewer/zh/train](https://huggingface.co/datasets/paws-x/viewer/zh/train)



wiki\_atomic\_edit



1,213,780







abhishek thakur











[https://huggingface.co/datasets/wiki\_atomic\_edits](https://huggingface.co/datasets/wiki_atomic_edits)



chatmed\_consult



549,326







Wei Zhu

 gpt3.5 









[https://huggingface.co/datasets/michaelwzhu/ChatMed\_Consult\_Dataset](https://huggingface.co/datasets/michaelwzhu/ChatMed_Consult_Dataset)



webqa



42,216







suolyer

2016









[https://huggingface.co/datasets/suolyer/webqa/viewer/suolyer--webqa/train?p=3](https://huggingface.co/datasets/suolyer/webqa/viewer/suolyer--webqa/train?p=3)



dureader\_robust



65,937

 







DuReader robust









[https://huggingface.co/datasets/PaddlePaddle/dureader\_robust/viewer/plain\_text/train?row=96](https://huggingface.co/datasets/PaddlePaddle/dureader_robust/viewer/plain_text/train?row=96)



csl



395,927







Yudong Li, Yuqing Zhang, Zhe Zhao, Linlin Shen, Weijie Liu, Weiquan Mao and Hui Zhang

CSL 396,209  CSL NLP 









[https://huggingface.co/datasets/neuclir/csl](https://huggingface.co/datasets/neuclir/csl)



miracl-corpus



4,934,368







MIRACL

The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., \\n\\n in the wiki markup). Each of these passages comprises a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.









[https://huggingface.co/datasets/miracl/miracl-corpus](https://huggingface.co/datasets/miracl/miracl-corpus)



lawzhidao



36,368







-Ustinian











[https://www.heywhale.com/mw/dataset/5e953ca8e7ec38002d02fca7/content](https://www.heywhale.com/mw/dataset/5e953ca8e7ec38002d02fca7/content)



CINLID



34,746









Chinese Idioms Natural Language Inference Dataset106832entailmentcontradictionneutralNLI









[https://www.luge.ai/#/luge/dataDetail?id=39](https://www.luge.ai/#/luge/dataDetail?id=39)



DuSQL

SQL

25,003

NL2SQL

SQL





DuSQL200164









[https://www.luge.ai/#/luge/dataDetail?id=13](https://www.luge.ai/#/luge/dataDetail?id=13)



Zhuiyi-NL2SQL

SQL

45,918

NL2SQL

SQL



 

NL2SQL









[https://www.luge.ai/#/luge/dataDetail?id=12](https://www.luge.ai/#/luge/dataDetail?id=12)



Cspider

SQL

7,785

NL2SQL

SQL



 

CSpider









[https://www.luge.ai/#/luge/dataDetail?id=11](https://www.luge.ai/#/luge/dataDetail?id=11)



news2016zh



2,507,549







Bright Xu

2506.3









[https://github.com/brightmart/nlp\_chinese\_corpus](https://github.com/brightmart/nlp_chinese_corpus)



baike2018qa



1,470,142







Bright Xu

15049210434









[https://github.com/brightmart/nlp\_chinese\_corpus](https://github.com/brightmart/nlp_chinese_corpus)



webtext2019zh



4,258,310







Bright Xu

4102.8









[https://github.com/brightmart/nlp\_chinese\_corpus](https://github.com/brightmart/nlp_chinese_corpus)



SimCLUE



775,593







 simCLUE 











[https://github.com/CLUEbenchmark/SimCLUE](https://github.com/CLUEbenchmark/SimCLUE)



Chinese-SQuAD



76,449







junzeng-pluto

Squad









[https://github.com/pluto-junzeng/ChineseSquad](https://github.com/pluto-junzeng/ChineseSquad)



[](#-) 
-------------------

*     MTEB  BenchMark, [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh)
*     Large 
*     Finetuner 
*    
*     M3E  m3e-hq huggingface 
*     m3e-hq  hard negative  m3e-hq-with-score huggingface 
*     m3e-hq-with-score  [cosent loss](https://github.com/wangyuxinwhy/uniem/blob/main/uniem/criteria.py#LL24C39-L24C39) loss CoSent [](https://kexue.fm/archives/8847)
*     M3E models

[](#-) 
---------------



[](#-license) License
-------------------------

M3E models  M3E models  M3E 

[](#citation)Citation
---------------------

Please cite this model using the following format:

      @software {Moka Massive Mixed Embedding,  
      author = {Wang Yuxin,Sun Qingxuan,He sicheng},  
      title = {M3E: Moka Massive Mixed Embedding Model},  
      year = {2023}
      }

## Model Overview

The `m3e-base` model is part of the M3E (Moka Massive Mixed Embedding) series of models developed by Moka AI. M3E models are designed to be versatile, supporting a variety of natural language processing tasks such as dense retrieval, multi-vector retrieval, and sparse retrieval. The `m3e-base` model has 110 million parameters and a hidden size of 768.

M3E models are trained on a massive 2.2 billion+ token corpus, making them well-suited for general-purpose language understanding. The models have demonstrated strong performance on benchmarks like [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh), outperforming models like [openai-ada-002](https://huggingface.co/openai-ada-002) on tasks like sentence-to-sentence (s2s) accuracy and sentence-to-passage (s2p) nDCG@10.

Similar models in the M3E series include the [m3e-small](https://huggingface.co/moka-ai/m3e-small) and [m3e-large](https://aimodels.fyi/models/huggingFace/m3e-large-moka-ai) versions, which have different parameter sizes and performance characteristics depending on the task.

## Model Inputs and Outputs

### Inputs
- **Text**: The `m3e-base` model can accept text inputs of varying lengths, up to a maximum of 8,192 tokens.

### Outputs
- **Embeddings**: The model outputs dense vector representations of the input text, which can be used for a variety of downstream tasks such as similarity search, text classification, and retrieval.

## Capabilities

The `m3e-base` model has demonstrated strong performance on a range of natural language processing tasks, including:

- **Sentence Similarity**: The model can be used to compute the semantic similarity between sentences, which is useful for applications like paraphrase detection and text summarization.
- **Text Classification**: The embeddings produced by the model can be used as features for training text classification models, such as for sentiment analysis or topic classification.
- **Retrieval**: The model's dense and sparse retrieval capabilities make it well-suited for building search engines and question-answering systems.

## What Can I Use It For?

The versatility of the `m3e-base` model makes it a valuable tool for a wide range of natural language processing applications. Some potential use cases include:

- **Semantic Search**: Use the model's dense embeddings to build a semantic search engine, allowing users to find relevant information based on the meaning of their queries rather than just keyword matching.
- **Personalized Recommendations**: Leverage the model's strong text understanding capabilities to build personalized recommendation systems, such as for content or product recommendations.
- **Chatbots and Conversational AI**: Integrate the model into chatbot or virtual assistant applications to enable more natural and contextual language understanding and generation.

## Things to Try

One interesting aspect of the `m3e-base` model is its ability to perform both dense and sparse retrieval. This hybrid approach can be beneficial for building more robust and accurate retrieval systems.

To experiment with the model's retrieval capabilities, you can try integrating it with tools like [chroma](https://docs.trychroma.com/getting-started), [guidance](https://github.com/microsoft/guidance), and [semantic-kernel](https://github.com/microsoft/semantic-kernel). These tools provide abstractions and utilities for building search and question-answering applications using large language models like `m3e-base`.

Additionally, the [uniem](https://github.com/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb) library provides a convenient interface for fine-tuning the `m3e-base` model on domain-specific datasets, which can further improve its performance on your specific use case.