[](#piccolo-large-zh)piccolo-large-zh
-------------------------------------

piccoloembedding(), piccoloE5GTE 4()softmax 2000()softmax piccolo-base-zhpiccolo-large-zh

piccolo is a general text embedding model(chinese), powered by General Model Group from SenseTime Research. Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet, and train the model with the pair(text and text pos) softmax contrastive loss. On the second stage, we collect 20 million human labeled chinese text pairs dataset, and finetune the model with tiplet (text, text\_pos, text\_neg) contrastive loss. Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.

[](#metric)Metric
-----------------

piccoloembeddingCMTEBCMTEB[eval](https://huggingface.co/sensenova/piccolo-base-zh/tree/main/eval)

We compared the performance of the piccolo with other embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard. we provide scripts in ["eval" folder](https://huggingface.co/sensenova/piccolo-base-zh/tree/main/eval) for results reproducing.

Model Name

Model Size (GB)

Dimension

Sequence Length

Average (35)

Classification (9)

Clustering (4)

Pair Classification (2)

Reranking (4)

Retrieval (8)

STS (8)

\[**piccolo-large-zh**\]

0.65

1024

512

**64.11**

67.03

47.04

78.38

65.98

70.93

58.02

\[bge-large-zh\]

1.3

1024

512

63.96

68.32

48.39

78.94

65.11

71.52

54.98

\[**piccolo-base-zh**\]

0.2

768

512

**63.66**

66.98

47.12

76.61

66.68

71.2

55.9

\[bge-large-zh-no-instruct\]

1.3

1024

512

63.4

68.58

50.01

76.77

64.9

70.54

53

\[bge-base-zh\]

0.41

768

512

62.8

67.07

47.64

77.5

64.91

69.53

54.12

[](#usage)Usage
---------------

sentence-transformer packagepiccolo

    # for s2s dataset, you can use piccolo as below
    # 
    from sentence_transformers import SentenceTransformer
    sentences = ["1", "2"]
    model = SentenceTransformer('sensenova/piccolo-base-zh')
    embeddings_1 = model.encode(sentences, normalize_embeddings=True)
    embeddings_2 = model.encode(sentences, normalize_embeddings=True)
    similarity = embeddings_1 @ embeddings_2.T
    print(similarity)
    # for s2p dataset, we recommend to add instruction for passage retrieval
    # instruction
    from sentence_transformers import SentenceTransformer
    queries = ['query_1', 'query_2']
    passages = ["doc_1", "doc_2"]
    model = SentenceTransformer('sensenova/piccolo-base-zh')
    q_embeddings = model.encode(["" + q for q in queries], normalize_embeddings=True)
    p_embeddings = model.encode(["" + p for p in passages], normalize_embeddings=True)
    scores = q_embeddings @ p_embeddings.T
    

[](#training-detail)Training Detail
-----------------------------------

### [](#pretrain)pretrain

pretrain max length, 128max lengthbatch size pretrain contrastive losshard negative, inbatch negative3240G A100batch size1024

Pretrain usually does not require a large max length, and 128 is recommended. A small max length is used to increase batch size and speed up training to adapt to large-scale data. We use binary contrastive loss for pretrain loss, without adding hard negative, and directly use inbatch negative. In actual training, we used 32 40G A100 for training, and the batch size of a single card is 1024.

### [](#finetune)finetune

finetune  max length512finetunesample S2Pretrieval finetune contrastive losshard negativeneg num2-7lossGTEimproved contrastive loss : querypassagemax lengthquerymax length64

For finetuning, we usually expands the max length to 512. To adapt to larger length text input, finetune will sample more S2P data to enhance the performance of the model on retrieval tasks. The finetune loss uses triple contrastive loss, adding hard negative. Neg num is usually set to 2-7. The loss calculation method can refer to the improved contrastive loss in GTE. Note: We set different max lengths for query and passage, and the max length of query is always kept at 64.

### [](#others)Others

trick:

1.  : fp16 + gradient checkpointing + ZERO STAGE1 (stage2 gradient checkpointing) issue: [https://github.com/microsoft/DeepSpeed/issues/988](https://github.com/microsoft/DeepSpeed/issues/988)
2.  dataset samplerM3Edataset samplerbatchdataset
3.  instructioninstructionretrieval': '': 'instruction

some useful tricks:

1.  The way to reduce memory usage: fp16 + gradient checkpointing + ZERO STAGE1 (stage2 does not support gradient checkpointing under the double-tower structure) For related issues, see: [https://github.com/microsoft/DeepSpeed/issues/](https://github.com/microsoft/DeepSpeed/issues/) 988
2.  Dataset sampler, we use M3E's dataset sampler to ensure that the samples in each batch come from a dataset, and negative samples are more valuable.
3.  instruction. Instruction has greatly improved the performance of the retrieval task in our experiments. We added instructions like 'query: ' and 'result: ' before each training sample.

[](#reference)Reference
-----------------------

embedding

1.  [M3E](https://github.com/wangyuxinwhy/uniem)embeddinguniem
2.  [Text2vec](https://github.com/shibing624/text2vec)embedding
3.  [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)AIembeddingCMTEB benchmarkembedding
4.  [E5](https://github.com/microsoft/unilm/tree/master/e5)
5.  [GTE](https://huggingface.co/thenlper/gte-base)embedding

Here we list the embedding projects and papers we have referenced

1.  [M3E](https://github.com/wangyuxinwhy/uniem). A great Chinese open source embedding project that collects and organizes a large number of high-quality Chinese datasets. Uniem is also a good framework.
2.  [Text2vec](https://github.com/shibing624/text2vec). Another great Chinese open source embedding project.
3.  [Flag Embedding](https://github.com/FlagOpen/FlagEmbedding). Zhiyuan AIs open source embedding model.They collect and organize CMTEB benchmark, filling the gap in systematic evaluation of Chinese embeddings.
4.  [E5](https://github.com/microsoft/unilm/tree/master/e5). Powerd by microsoftproducing very detailed ablation experiments and data processing filtering details.
5.  [GTE](https://huggingface.co/thenlper/gte-base). An embedding paper from Alibaba Damo.

[](#license)License
-------------------

Piccolo  MIT License Piccolo use MIT License. It can be used for commercial purposes free of charge.

[](#acknowledgement)Acknowledgement
-----------------------------------

piccolo [Jinkin](https://huggingface.co/Jinkin)  [Jinkin](https://huggingface.co/Jinkin), [CCCCxxx](https://huggingface.co/CCCCxxx) .  [Gaomengya](https://huggingface.co/gaomengya)  [chaorenwu111](https://huggingface.co/chaorenwu111)  [lux0933](https://huggingface.co/lux0933)[yangkai001](https://huggingface.co/yangkai001)

piccolo is powered by Genral Model group from SenseTime Research. [Jinkin](https://huggingface.co/Jinkin) complete code implementation and model training. [Jinkin](https://huggingface.co/Jinkin), [CCCCxxx](https://huggingface.co/CCCCxxx) completed the data collectionprocessing and model evaluation together. Project is led by [Gaomengya](https://huggingface.co/gaomengya) and [chaorenwu111](https://huggingface.co/chaorenwu111). At the same time, thank [lux0933](https://huggingface.co/lux0933) and [yangkai001](https://huggingface.co/yangkai001) for the discussion, which provide a lot of useful suggestions.

## Model Overview

The `piccolo-large-zh` is a general text embedding model for Chinese, powered by the General Model Group from SenseTime Research. Inspired by E5 and GTE, `piccolo` is trained using a two-stage pipeline. First, the model is trained on 400 million weakly supervised Chinese text pairs collected from the internet, using a pair (text and text pos) softmax contrastive loss. In the second stage, the model is fine-tuned on 20 million human-labeled Chinese text pairs, using a triplet (text, text_pos, text_neg) contrastive loss. This approach enables `piccolo-large-zh` to capture rich semantic information and perform well on a variety of downstream tasks.

The `piccolo-large-zh` model has 1024 embedding dimensions and can handle input sequences up to 512 tokens long. It outperforms other Chinese embedding models like [bge-large-zh](https://aimodels.fyi/models/huggingFace/bge-base-zh-dmetsoul) and [piccolo-base-zh](https://aimodels.fyi/models/huggingFace/sensenova) on the C-MTEB benchmark, achieving an average score of 64.11 across 35 datasets.

## Model Inputs and Outputs

### Inputs
- Text sequences up to 512 tokens long

### Outputs
- 1024-dimensional text embeddings that capture the semantic meaning of the input text

## Capabilities

The `piccolo-large-zh` model is highly capable at encoding Chinese text into semantic representations. These embeddings can be used for a variety of downstream tasks, such as:

- Information retrieval: The embeddings can be used to find relevant documents or passages given a query.
- Semantic search: The model can be used to find similar documents or passages based on their semantic content.
- Text classification: The embeddings can be used as features for training text classification models.
- Paraphrase detection: The model can be used to identify paraphrases of a given input text.

## What Can I Use It For?

The `piccolo-large-zh` model can be used in a wide range of applications that involve working with Chinese text. Some potential use cases include:

- **Search and Recommendation**: Use the embeddings to build semantic search engines or recommendation systems for Chinese content.
- **Content Clustering and Organization**: Group related Chinese documents or passages based on their semantic similarity.
- **Text Analytics and Insights**: Extract meaningful insights from Chinese text data by leveraging the model's ability to capture semantic meaning.
- **Multilingual Applications**: Combine `piccolo-large-zh` with other language models to build cross-lingual applications.

## Things to Try

One interesting aspect of the `piccolo-large-zh` model is its ability to handle long input sequences, up to 512 tokens. This makes it well-suited for tasks involving long-form Chinese text, such as document retrieval or question answering. You could try experimenting with the model's performance on such tasks and see how it compares to other Chinese language models.

Another interesting avenue to explore would be to fine-tune the `piccolo-large-zh` model on domain-specific data, such as scientific literature or legal documents, to see if it can capture specialized semantic knowledge in those areas. This could lead to improved performance on tasks like technical search or legal document classification.

[EN](/sensenova/piccolo-large-zh-v2/blob/main/README.md) | [](/sensenova/piccolo-large-zh-v2/blob/main/README_zh.md)

**News**  
**\[2024-05-16\]**  
Due to certain internal company considerations, we have temporarily removed the model weights. It will be uploaded again after passing our internal review process. Please temporarily access this model via API: [https://platform.sensenova.cn/doc?path=/chat/Embeddings/Embeddings.md](https://platform.sensenova.cn/doc?path=/chat/Embeddings/Embeddings.md) There is a temporary problem with the API of this page. Please access it temporarily in the following way:

    import requests
    url = "http://103.237.28.72:8006/v1/qd"
    headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json'
    }
    data = {
        "inputs": ['hello,world']
    }
    response = requests.post(url, json=data, headers=headers)
    print(response.json())
    

**\[2024-05-14\]**  
We have currently release our model weights, training code, and tech report. Discussions are welcome.  
For training code, please refer to our [github](https://github.com/hjq133/piccolo-embedding)  
For training details, please refer to our [tech-report](https://arxiv.org/abs/2405.06932)

**\[2024-04-22\]**

piccolo-large-zh-v2 currently ranks first on the C-MTEB list, leading the previous BERT model by about 1.9 points.

[](#piccolo-large-zh-v2)Piccolo-large-zh-v2
-------------------------------------------

piccolo-large-zh-v2 is a Chinese embedding model developed by the general model group from SenseTime Research. This upgraded version of Piccolo aims to prioritize general downstream fine-tuning methods. Piccolo2 primarily leverages an efficient multi-task hybrid loss training approach, effectively harnessing textual data and labels from diverse downstream tasks. In addition, Piccolo2 scales up the embedding dimension and uses MRL training to support more flexible vector dimensions.

[](#-model-hightlights) Model Hightlights
---------------------------------------------

The main feature of piccolo2 is that it uses a multi-task hybrid loss during training.  
For retrieval/sorting tasks, we use the standard InfoNCE with in-batch-negative:

![](/sensenova/piccolo-large-zh-v2/resolve/main/assets/1.png)

For sts/pair classification tasks, we use cosent loss, which is proved to be better for data with more fine-grained labels(e.g. score values ):

![](/sensenova/piccolo-large-zh-v2/resolve/main/assets/2.png)

For classification/clustering tasks, by treating text and its semantic labels as positive and negative pairs, we convert the dataset into the format of triples. And then we use InfoNCE to optimize it. However, its important to stress that in-batch negatives are no longer used due to the fact that it can easily lead to conflict training targets:

![](/sensenova/piccolo-large-zh-v2/resolve/main/assets/3.png)

[](#-experiments-and-results) Experiments and Results
---------------------------------------------------------

Piccolo2 primarily focuses on the downstream general finetune paradigm. Our open source model uses [stella-v3.5](https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d) as initialization and trained about 2500 steps on 32 GPUS. For more implementation details, please refer to our [technical report](https://arxiv.org/abs/2405.06932).

Model Name

Model Size (GB)

Dimension

Sequence Length

Classification (9)

Clustering (4)

Pair Classification (2)

Reranking (4)

Retrieval (8)

STS (8)

Average (35)

[**piccolo-large-zh-v2**](https://huggingface.co/sensenova/piccolo-large-zh-v2)

1.21

1792

512

74.59

62.17

90.24

70

74.36

63.5

70.95

[gte-Qwen1.5-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct)

26.45

32768

4096

73.35

67.08

88.52

66.38

70.62

62.32

69.56

[acge-text-embedding](https://huggingface.co/aspire/acge_text_embedding)

1.21

1792

512

72.75

58.7

87.84

67.98

72.93

62.09

69.07

[](#-usage) Usage
---------------------

The piccolo model can be easily accessed in the sentence-transformer package:

    # for s2s/s2p dataset, you can use piccolo as below
    from sklearn.preprocessing import normalize
    from sentence_transformers import SentenceTransformer
    sentences = ["1", "2"]
    matryoshka_dim=1792 # support 256, 512, 768, 1024, 1280, 1536, 1792
    model = SentenceTransformer('sensenova/piccolo-large-zh-v2')
    embeddings_1 = model.encode(sentences, normalize_embeddings=False)
    embeddings_2 = model.encode(sentences, normalize_embeddings=False)
    embeddings_1 = normalize(embeddings_1[..., :matryoshka_dim], norm="l2", axis=1)
    embeddings_2 = normalize(embeddings_2[..., :matryoshka_dim], norm="l2", axis=1)
    similarity = embeddings_1 @ embeddings_2.T
    

[](#-model-list) **Model List**
-----------------------------------

Model

Language

Description

prompt

[sensenova/piccolo-large-zh-v2](https://huggingface.co/sensenova/piccolo-large-zh-v2)

Chinese

version2: finetuning with multi-task hybrid loss training

None

[sensenova/piccolo-large-zh](https://huggingface.co/sensenova/piccolo-large-zh)

Chinese

version1: pretrain under 400 million chinese text pair

''/''

[sensenova/piccolo-base-zh](https://huggingface.co/sensenova/piccolo-base-zh)

Chinese

version1: pretrain under 400 million chinese text pair

''/''

[](#citation)Citation
---------------------

If you find our tech report, models or code helpful, please cite our report or give a star on github or huggingface!

    @misc{2405.06932,
    Author = {Junqin Huang and Zhongjie Hu and Zihao Jing and Mengya Gao and Yichao Wu},
    Title = {Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training},
    Year = {2024},
    Eprint = {arXiv:2405.06932},
    }

## Model overview

The `piccolo-large-zh-v2` model is a Chinese text embedding model developed by the General Model Group from SenseTime Research. This upgraded version of the original Piccolo model aims to improve upon general downstream fine-tuning methods. Piccolo2 primarily leverages an efficient multi-task hybrid loss training approach, effectively harnessing textual data and labels from diverse downstream tasks. Additionally, Piccolo2 scales up the embedding dimension and uses MRL training to support more flexible vector dimensions.

Compared to similar models like the [piccolo-large-zh](https://aimodels.fyi/models/huggingFace/piccolo-large-zh-sensenova) and [Baichuan2-7B-Base](https://aimodels.fyi/models/huggingFace/baichuan2-7b-base-baichuan-inc), the `piccolo-large-zh-v2` model utilizes a multi-task hybrid loss training approach and larger embedding dimensions to enhance its performance on downstream tasks.

## Model inputs and outputs

### Inputs
- **Text**: The `piccolo-large-zh-v2` model takes text inputs and generates text embeddings.

### Outputs
- **Text embeddings**: The model outputs fixed-size vector representations of the input text, which can be used for a variety of downstream NLP tasks such as text classification, retrieval, and similarity matching.

## Capabilities

The `piccolo-large-zh-v2` model has demonstrated strong performance on the C-MTEB benchmark, outperforming previous BERT models by around 1.9 points. The model's key capabilities include:

- Effective text representation learning through a multi-task hybrid loss training approach
- Support for flexible vector dimensions through MRL training
- Robust performance on a wide range of NLP tasks, including text retrieval, classification, and similarity matching

## What can I use it for?

The `piccolo-large-zh-v2` model can be used for a variety of NLP applications that require high-quality text embeddings, such as:

- [Semantic search and information retrieval](https://aimodels.fyi/models/huggingFace/piccolo-large-zh-sensenova)
- [Text classification and clustering](https://aimodels.fyi/models/huggingFace/piccolo-large-zh-sensenova)
- [Recommendation systems](https://aimodels.fyi/models/huggingFace/piccolo-large-zh-sensenova)
- [Question-answering and dialog systems](https://aimodels.fyi/models/huggingFace/piccolo-large-zh-sensenova)

The model's strong performance and efficient architecture make it a suitable choice for a wide range of applications that require high-quality text representations.

## Things to try

One interesting aspect of the `piccolo-large-zh-v2` model is its use of a multi-task hybrid loss training approach. This allows the model to effectively leverage diverse datasets and task labels, leading to improved performance on downstream tasks. Researchers and developers could experiment with applying this training strategy to other NLP models or datasets to see if similar performance gains can be achieved.

Additionally, the model's support for flexible vector dimensions through MRL training opens up possibilities for exploring more efficient and scalable text representation learning. Users could experiment with adjusting the vector dimensions to find the optimal balance between model size, inference speed, and task-specific performance.