[](#m3e-models)M3E Models
=========================

[m3e-small](https://huggingface.co/moka-ai/m3e-small) | [m3e-base](https://huggingface.co/moka-ai/m3e-base) | [m3e-large](https://huggingface.co/moka-ai/m3e-large)

M3E  Moka Massive Mixed Embedding 

*   Moka MokaAI  [uniem](https://github.com/wangyuxinwhy/uniem/blob/main/scripts/train_m3e.py)  BenchMark  [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh)
*   Massive**** (2200w+) 
*   Mixed
*   Embedding

[](#)
-------------

*   2023.06.14 UER, ErLangShen, DMetaSoul
*   2023.06.08 T2Ranking 1W m3e-base  ndcg@10  0.8004 openai-ada-002  0.7786
*   2023.06.07 6 m3e-base  accuracy  0.6157 openai-ada-002  0.5956

[](#)
-------------









s2s

s2p

s2c





s2s Acc

s2p ndcg@10

m3e-small

24M

512















0.5834

0.7262

m3e-base

110M

768















0.6157

**0.8004**

m3e-large

340M

768















**0.6231**

0.7974

text2vec

110M

768















0.5755

0.6346

openai-ada-002



1536















0.5956

0.7786



*   s2s,  sentence to sentence 
*   s2p,  sentence to passage GPT 
*   s2c,  sentence to code 
*    m3e  text2vec  sentence-transformers  openai 
*   ACC & ndcg@10

Tips:

*    m3e 
*    openai text-embedding-ada-002
*    openai text-embedding-ada-002
*    S2S 

[](#)
-------------

 sentence-transformers

    pip install -U sentence-transformers
    

 M3E Models

    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('moka-ai/m3e-base')
    
    #Our sentences we like to encode
    sentences = [
        '* Moka  MokaAI  uniem',
        '* Massive ****',
        '* Mixed ALL in one'
    ]
    
    #Sentences are encoded by calling model.encode()
    embeddings = model.encode(sentences)
    
    #Print the embeddings
    for sentence, embedding in zip(sentences, embeddings):
        print("Sentence:", sentence)
        print("Embedding:", embedding)
        print("")
    

M3E  [sentence-transformers](https://www.sbert.net/) **** sentence-transformers **** M3E Models [chroma](https://docs.trychroma.com/getting-started), [guidance](https://github.com/microsoft/guidance), [semantic-kernel](https://github.com/microsoft/semantic-kernel) 

[](#)
-------------

M3E  in-batch  in-batch  A100 80G  batch-size 2200W+  1 epoch [uniem](https://github.com/wangyuxinwhy/uniem/blob/main/scripts/train_m3e.py)

[](#)
---------

*   M3E  2200W  [M3E ](#M3E%E6%95%B0%E6%8D%AE%E9%9B%86)
*   M3E  MEDI 145W  [MEDI ](https://drive.google.com/file/d/1vZ5c2oJNonGOvXzppNg5mHz24O6jcc52/view) [instructor team](https://github.com/HKUNLP/instructor-embedding) 
*   M3E  300W +  M3E  [instructor-embedding](https://github.com/HKUNLP/instructor-embedding)
*   M3E  hfl  [Roberta](https://huggingface.co/hfl/chinese-roberta-wwm-ext)  smallbaselarge
*   ALL IN ONEM3E  ALL IN ONE 

[](#)
---------

*   [text2vec](https://github.com/shibing624/text2vec), m3e-base, m3e-small, openai text-embedding-ada-002, [DMetaSoul](https://huggingface.co/DMetaSoul/sbert-chinese-general-v2), [UER](https://huggingface.co/uer/sbert-base-chinese-nli), [ErLangShen](https://huggingface.co/IDEA-CCNL/Erlangshen-SimCSE-110M-Chinese)
*    \[MTEB-zh\] ([https://github.com/wangyuxinwhy/uniem/blob/main/mteb-zh](https://github.com/wangyuxinwhy/uniem/blob/main/mteb-zh))

### [](#)

*    HuggingFace  6 
*    MTEB  Accuracy

text2vec

m3e-small

m3e-base

m3e-large

openai

DMetaSoul

uer

erlangshen

TNews

0.43

0.4443

0.4827

**0.4866**

0.4594

0.3084

0.3539

0.4361

JDIphone

0.8214

0.8293

0.8533

**0.8692**

0.746

0.7972

0.8283

0.8356

GubaEastmony

0.7472

0.712

0.7621

0.7663

0.7574

0.735

0.7534

**0.7787**

TYQSentiment

0.6099

0.6596

0.7188

**0.7247**

0.68

0.6437

0.6662

0.6444

StockComSentiment

0.4307

0.4291

0.4363

0.4475

**0.4819**

0.4309

0.4555

0.4482

IFlyTek

0.414

0.4263

0.4409

0.4445

**0.4486**

0.3969

0.3762

0.4241

Average

0.5755

0.5834

0.6157

**0.6231**

0.5956

0.552016667

0.57225

0.594516667

### [](#)

#### [](#t2ranking-1w)T2Ranking 1W

*    [T2Ranking](https://github.com/THUIR/T2Ranking/tree/main)  T2Ranking openai  api  T2Ranking  10000 
*    MTEB  map@1, map@10, mrr@1, mrr@10, ndcg@1, ndcg@10
*    M3E  openai 

text2vec

openai-ada-002

m3e-small

m3e-base

m3e-large

DMetaSoul

uer

erlangshen

map@1

0.4684

0.6133

0.5574

**0.626**

0.6256

0.25203

0.08647

0.25394

map@10

0.5877

0.7423

0.6878

**0.7656**

0.7627

0.33312

0.13008

0.34714

mrr@1

0.5345

0.6931

0.6324

0.7047

**0.7063**

0.29258

0.10067

0.29447

mrr@10

0.6217

0.7668

0.712

**0.7841**

0.7827

0.36287

0.14516

0.3751

ndcg@1

0.5207

0.6764

0.6159

0.6881

**0.6884**

0.28358

0.09748

0.28578

ndcg@10

0.6346

0.7786

0.7262

**0.8004**

0.7974

0.37468

0.15783

0.39329

#### [](#t2ranking)T2Ranking

*    T2Ranking openai-ada-002  T2Ranking 10W  T2Ranking 50W T2Ranking ... 128G 
*    MTEB  ndcg@10

text2vec

m3e-small

m3e-base

t2r-1w

0.6346

0.72621

**0.8004**

t2r-10w

0.44644

0.5251

**0.6263**

t2r-50w

0.33482

0.38626

**0.47364**



*    text2vec  text2vec 

[](#m3e)M3E
-----------------

 [uniem process\_zh\_datasets](https://github.com/wangyuxinwhy/uniem/blob/main/scripts/process_zh_datasets.py)  huggingface  huggingface 









Prompt







/





Done

URL



cmrc2018



14,363







Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, Guoping Hu

[https://github.com/ymcui/cmrc2018/blob/master/README\_CN.md](https://github.com/ymcui/cmrc2018/blob/master/README_CN.md) 









[https://huggingface.co/datasets/cmrc2018](https://huggingface.co/datasets/cmrc2018)



belle\_2m



2,000,000







LianjiaTech/BELLE

belle  self instruct  gpt3.5 









[https://huggingface.co/datasets/BelleGroup/train\_2M\_CN](https://huggingface.co/datasets/BelleGroup/train_2M_CN)



firefily



1,649,399







YeungNLP

Firefly Instruction TuningZeRO 









[https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)



alpaca\_gpt4



48,818







Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, Jianfeng Gao

AlpacaGPT4self-instruct5









[https://huggingface.co/datasets/shibing624/alpaca-zh](https://huggingface.co/datasets/shibing624/alpaca-zh)



zhihu\_kol



1,006,218







wangrui6











[https://huggingface.co/datasets/wangrui6/Zhihu-KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL)



hc3\_chinese



39,781







Hello-SimpleAI

 GPT 









[https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese)



amazon\_reviews\_multi



210,000

 

















[https://huggingface.co/datasets/amazon\_reviews\_multi/viewer/zh/train?row=8](https://huggingface.co/datasets/amazon_reviews_multi/viewer/zh/train?row=8)



mlqa



85,853







patrickvonplaten











[https://huggingface.co/datasets/mlqa/viewer/mlqa-translate-train.zh/train?p=2](https://huggingface.co/datasets/mlqa/viewer/mlqa-translate-train.zh/train?p=2)



xlsum



93,404







BUET CSE NLP Group

BBC









[https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/chinese\_simplified/train?row=259](https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/chinese_simplified/train?row=259)



ocnli



17,726







Thomas Wolf











[https://huggingface.co/datasets/clue/viewer/ocnli](https://huggingface.co/datasets/clue/viewer/ocnli)



BQ



60,000







Intelligent Computing Research Center, Harbin Institute of Technology(Shenzhen)

[http://icrc.hitsz.edu.cn/info/1037/1162.htm](http://icrc.hitsz.edu.cn/info/1037/1162.htm) BQ  120000 100000 10000 10000   









[https://huggingface.co/datasets/shibing624/nli\_zh/viewer/BQ](https://huggingface.co/datasets/shibing624/nli_zh/viewer/BQ)



lcqmc



149,226







Ming Xu

LCQMC  COLING2018 









[https://huggingface.co/datasets/shibing624/nli\_zh/viewer/LCQMC/train](https://huggingface.co/datasets/shibing624/nli_zh/viewer/LCQMC/train)



paws-x



23,576







Bhavitvya Malik

PAWS Wiki









[https://huggingface.co/datasets/paws-x/viewer/zh/train](https://huggingface.co/datasets/paws-x/viewer/zh/train)



wiki\_atomic\_edit



1,213,780







abhishek thakur











[https://huggingface.co/datasets/wiki\_atomic\_edits](https://huggingface.co/datasets/wiki_atomic_edits)



chatmed\_consult



549,326







Wei Zhu

 gpt3.5 









[https://huggingface.co/datasets/michaelwzhu/ChatMed\_Consult\_Dataset](https://huggingface.co/datasets/michaelwzhu/ChatMed_Consult_Dataset)



webqa



42,216







suolyer

2016









[https://huggingface.co/datasets/suolyer/webqa/viewer/suolyer--webqa/train?p=3](https://huggingface.co/datasets/suolyer/webqa/viewer/suolyer--webqa/train?p=3)



dureader\_robust



65,937

 







DuReader robust









[https://huggingface.co/datasets/PaddlePaddle/dureader\_robust/viewer/plain\_text/train?row=96](https://huggingface.co/datasets/PaddlePaddle/dureader_robust/viewer/plain_text/train?row=96)



csl



395,927







Yudong Li, Yuqing Zhang, Zhe Zhao, Linlin Shen, Weijie Liu, Weiquan Mao and Hui Zhang

CSL 396,209  CSL NLP 









[https://huggingface.co/datasets/neuclir/csl](https://huggingface.co/datasets/neuclir/csl)



miracl-corpus



4,934,368







MIRACL

The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., \\n\\n in the wiki markup). Each of these passages comprises a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.









[https://huggingface.co/datasets/miracl/miracl-corpus](https://huggingface.co/datasets/miracl/miracl-corpus)



lawzhidao



36,368







-Ustinian











[https://www.heywhale.com/mw/dataset/5e953ca8e7ec38002d02fca7/content](https://www.heywhale.com/mw/dataset/5e953ca8e7ec38002d02fca7/content)



CINLID



34,746









Chinese Idioms Natural Language Inference Dataset106832entailmentcontradictionneutralNLI









[https://www.luge.ai/#/luge/dataDetail?id=39](https://www.luge.ai/#/luge/dataDetail?id=39)



DuSQL

SQL

25,003

NL2SQL

SQL





DuSQL200164









[https://www.luge.ai/#/luge/dataDetail?id=13](https://www.luge.ai/#/luge/dataDetail?id=13)



Zhuiyi-NL2SQL

SQL

45,918

NL2SQL

SQL



 

NL2SQL









[https://www.luge.ai/#/luge/dataDetail?id=12](https://www.luge.ai/#/luge/dataDetail?id=12)



Cspider

SQL

7,785

NL2SQL

SQL



 

CSpider









[https://www.luge.ai/#/luge/dataDetail?id=11](https://www.luge.ai/#/luge/dataDetail?id=11)



news2016zh



2,507,549







Bright Xu

2506.3









[https://github.com/brightmart/nlp\_chinese\_corpus](https://github.com/brightmart/nlp_chinese_corpus)



baike2018qa



1,470,142







Bright Xu

15049210434









[https://github.com/brightmart/nlp\_chinese\_corpus](https://github.com/brightmart/nlp_chinese_corpus)



webtext2019zh



4,258,310







Bright Xu

4102.8









[https://github.com/brightmart/nlp\_chinese\_corpus](https://github.com/brightmart/nlp_chinese_corpus)



SimCLUE



775,593







 simCLUE 











[https://github.com/CLUEbenchmark/SimCLUE](https://github.com/CLUEbenchmark/SimCLUE)



Chinese-SQuAD



76,449







junzeng-pluto

Squad









[https://github.com/pluto-junzeng/ChineseSquad](https://github.com/pluto-junzeng/ChineseSquad)



[](#)
-----------

*     MTEB  BenchMark, [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh)
*     Large 
*    
*     M3E  m3e-hq huggingface 
*     m3e-hq  hard negative  m3e-hq-with-score huggingface 
*     m3e-hq-with-score  [cosent loss](https://github.com/wangyuxinwhy/uniem/blob/main/uniem/criteria.py#LL24C39-L24C39) loss CoSent [](https://kexue.fm/archives/8847)
*     M3E models

[](#)
---------



[](#license)License
-------------------

M3E models  M3E models  M3E 

[](#citation)Citation
---------------------

Please cite this model using the following format:

      @software {Moka Massive Mixed Embedding,  
      author = {Wang Yuxin,Sun Qingxuan,He sicheng},  
      title = {M3E: Moka Massive Mixed Embedding Model},  
      year = {2023}
      }

## Model overview

The `m3e-large` model is part of the M3E (Moka Massive Mixed Embedding) series of text embedding models developed by the Moka AI team. The M3E models are large-scale multilingual text embedding models that can be used for a variety of natural language processing tasks. The `m3e-large` model is the largest in the series, with 340 million parameters and a 768-dimensional embedding size.

The M3E models are designed to provide strong performance on a range of benchmarks, including the MTEB-zh Chinese language benchmark. Compared to similar models like [multilingual-e5-large](https://aimodels.fyi/models/huggingFace/multilingual-e5-large-beautyyuyanli), [bge-large-en-v1.5](https://aimodels.fyi/models/huggingFace/bge-large-en-v15-nateraw), and [moe-llava](https://aimodels.fyi/models/huggingFace/moe-llava-camenduru), the M3E models leverage a massive, mixed-domain training dataset to learn rich and generalizable text representations.

The `m3e-base` model in this series has also shown strong performance, outperforming OpenAI's `text-embedding-ada-002` model on several MTEB-zh tasks.

## Model inputs and outputs

### Inputs
- **Text sequences**: The `m3e-large` model can accept single sentences or longer text passages as input.

### Outputs
- **Text embeddings**: The model outputs fixed-length vector representations (embeddings) of the input text. These embeddings can be used for a variety of downstream tasks, such as semantic search, text classification, and clustering.

## Capabilities

The `m3e-large` model demonstrates strong performance on a variety of text-based tasks, especially those involving semantic understanding and retrieval. For example, it has achieved a 0.6231 accuracy score on the sentence-to-sentence (s2s) task and a 0.7974 NDCG@10 score on the sentence-to-passage (s2p) task in the MTEB-zh benchmark.

## What can I use it for?

The `m3e-large` model can be used for a wide range of natural language processing applications, such as:

- **Semantic search**: The rich text embeddings produced by the model can be used to build powerful semantic search engines, allowing users to find relevant information based on the meaning of their queries rather than just keyword matching.

- **Text classification**: The model's embeddings can be used as features for training high-performance text classification models, such as those for sentiment analysis, topic categorization, or intent detection.

- **Recommendation systems**: The semantic understanding of the `m3e-large` model can be leveraged to build advanced recommendation systems that suggest relevant content or products based on user preferences and behavior.

## Things to try

One interesting aspect of the `m3e-large` model is its potential for domain-specific fine-tuning. By further training the model on task-specific data using tools like the [uniem](https://github.com/wangyuxinwhy/uniem) library, you can likely achieve even stronger performance on specialized applications.

Additionally, the model's large size and diverse training data make it a promising starting point for exploring few-shot and zero-shot learning approaches, where the model can leverage its broad knowledge to quickly adapt to new tasks with limited additional training.