[](#piccolo-large-zh)piccolo-large-zh
-------------------------------------

piccoloembedding(), piccoloE5GTE 4()softmax 2000()softmax piccolo-base-zhpiccolo-large-zh

piccolo is a general text embedding model(chinese), powered by General Model Group from SenseTime Research. Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet, and train the model with the pair(text and text pos) softmax contrastive loss. On the second stage, we collect 20 million human labeled chinese text pairs dataset, and finetune the model with tiplet (text, text\_pos, text\_neg) contrastive loss. Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.

[](#metric)Metric
-----------------

piccoloembeddingCMTEBCMTEB[eval](https://huggingface.co/sensenova/piccolo-base-zh/tree/main/eval)

We compared the performance of the piccolo with other embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard. we provide scripts in ["eval" folder](https://huggingface.co/sensenova/piccolo-base-zh/tree/main/eval) for results reproducing.

Model Name

Model Size (GB)

Dimension

Sequence Length

Average (35)

Classification (9)

Clustering (4)

Pair Classification (2)

Reranking (4)

Retrieval (8)

STS (8)

\[**piccolo-large-zh**\]

0.65

1024

512

**64.11**

67.03

47.04

78.38

65.98

70.93

58.02

\[bge-large-zh\]

1.3

1024

512

63.96

68.32

48.39

78.94

65.11

71.52

54.98

\[**piccolo-base-zh**\]

0.2

768

512

**63.66**

66.98

47.12

76.61

66.68

71.2

55.9

\[bge-large-zh-no-instruct\]

1.3

1024

512

63.4

68.58

50.01

76.77

64.9

70.54

53

\[bge-base-zh\]

0.41

768

512

62.8

67.07

47.64

77.5

64.91

69.53

54.12

[](#usage)Usage
---------------

sentence-transformer packagepiccolo

    # for s2s dataset, you can use piccolo as below
    # 
    from sentence_transformers import SentenceTransformer
    sentences = ["1", "2"]
    model = SentenceTransformer('sensenova/piccolo-base-zh')
    embeddings_1 = model.encode(sentences, normalize_embeddings=True)
    embeddings_2 = model.encode(sentences, normalize_embeddings=True)
    similarity = embeddings_1 @ embeddings_2.T
    print(similarity)
    # for s2p dataset, we recommend to add instruction for passage retrieval
    # instruction
    from sentence_transformers import SentenceTransformer
    queries = ['query_1', 'query_2']
    passages = ["doc_1", "doc_2"]
    model = SentenceTransformer('sensenova/piccolo-base-zh')
    q_embeddings = model.encode(["" + q for q in queries], normalize_embeddings=True)
    p_embeddings = model.encode(["" + p for p in passages], normalize_embeddings=True)
    scores = q_embeddings @ p_embeddings.T
    

[](#training-detail)Training Detail
-----------------------------------

### [](#pretrain)pretrain

pretrain max length, 128max lengthbatch size pretrain contrastive losshard negative, inbatch negative3240G A100batch size1024

Pretrain usually does not require a large max length, and 128 is recommended. A small max length is used to increase batch size and speed up training to adapt to large-scale data. We use binary contrastive loss for pretrain loss, without adding hard negative, and directly use inbatch negative. In actual training, we used 32 40G A100 for training, and the batch size of a single card is 1024.

### [](#finetune)finetune

finetune  max length512finetunesample S2Pretrieval finetune contrastive losshard negativeneg num2-7lossGTEimproved contrastive loss : querypassagemax lengthquerymax length64

For finetuning, we usually expands the max length to 512. To adapt to larger length text input, finetune will sample more S2P data to enhance the performance of the model on retrieval tasks. The finetune loss uses triple contrastive loss, adding hard negative. Neg num is usually set to 2-7. The loss calculation method can refer to the improved contrastive loss in GTE. Note: We set different max lengths for query and passage, and the max length of query is always kept at 64.

### [](#others)Others

trick:

1.  : fp16 + gradient checkpointing + ZERO STAGE1 (stage2 gradient checkpointing) issue: [https://github.com/microsoft/DeepSpeed/issues/988](https://github.com/microsoft/DeepSpeed/issues/988)
2.  dataset samplerM3Edataset samplerbatchdataset
3.  instructioninstructionretrieval': '': 'instruction

some useful tricks:

1.  The way to reduce memory usage: fp16 + gradient checkpointing + ZERO STAGE1 (stage2 does not support gradient checkpointing under the double-tower structure) For related issues, see: [https://github.com/microsoft/DeepSpeed/issues/](https://github.com/microsoft/DeepSpeed/issues/) 988
2.  Dataset sampler, we use M3E's dataset sampler to ensure that the samples in each batch come from a dataset, and negative samples are more valuable.
3.  instruction. Instruction has greatly improved the performance of the retrieval task in our experiments. We added instructions like 'query: ' and 'result: ' before each training sample.

[](#reference)Reference
-----------------------

embedding

1.  [M3E](https://github.com/wangyuxinwhy/uniem)embeddinguniem
2.  [Text2vec](https://github.com/shibing624/text2vec)embedding
3.  [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)AIembeddingCMTEB benchmarkembedding
4.  [E5](https://github.com/microsoft/unilm/tree/master/e5)
5.  [GTE](https://huggingface.co/thenlper/gte-base)embedding

Here we list the embedding projects and papers we have referenced

1.  [M3E](https://github.com/wangyuxinwhy/uniem). A great Chinese open source embedding project that collects and organizes a large number of high-quality Chinese datasets. Uniem is also a good framework.
2.  [Text2vec](https://github.com/shibing624/text2vec). Another great Chinese open source embedding project.
3.  [Flag Embedding](https://github.com/FlagOpen/FlagEmbedding). Zhiyuan AIs open source embedding model.They collect and organize CMTEB benchmark, filling the gap in systematic evaluation of Chinese embeddings.
4.  [E5](https://github.com/microsoft/unilm/tree/master/e5). Powerd by microsoftproducing very detailed ablation experiments and data processing filtering details.
5.  [GTE](https://huggingface.co/thenlper/gte-base). An embedding paper from Alibaba Damo.

[](#license)License
-------------------

Piccolo  MIT License Piccolo use MIT License. It can be used for commercial purposes free of charge.

[](#acknowledgement)Acknowledgement
-----------------------------------

piccolo [Jinkin](https://huggingface.co/Jinkin)  [Jinkin](https://huggingface.co/Jinkin), [CCCCxxx](https://huggingface.co/CCCCxxx) .  [Gaomengya](https://huggingface.co/gaomengya)  [chaorenwu111](https://huggingface.co/chaorenwu111)  [lux0933](https://huggingface.co/lux0933)[yangkai001](https://huggingface.co/yangkai001)

piccolo is powered by Genral Model group from SenseTime Research. [Jinkin](https://huggingface.co/Jinkin) complete code implementation and model training. [Jinkin](https://huggingface.co/Jinkin), [CCCCxxx](https://huggingface.co/CCCCxxx) completed the data collectionprocessing and model evaluation together. Project is led by [Gaomengya](https://huggingface.co/gaomengya) and [chaorenwu111](https://huggingface.co/chaorenwu111). At the same time, thank [lux0933](https://huggingface.co/lux0933) and [yangkai001](https://huggingface.co/yangkai001) for the discussion, which provide a lot of useful suggestions.

## Model Overview

The `piccolo-large-zh` is a general text embedding model for Chinese, powered by the General Model Group from SenseTime Research. Inspired by E5 and GTE, `piccolo` is trained using a two-stage pipeline. First, the model is trained on 400 million weakly supervised Chinese text pairs collected from the internet, using a pair (text and text pos) softmax contrastive loss. In the second stage, the model is fine-tuned on 20 million human-labeled Chinese text pairs, using a triplet (text, text_pos, text_neg) contrastive loss. This approach enables `piccolo-large-zh` to capture rich semantic information and perform well on a variety of downstream tasks.

The `piccolo-large-zh` model has 1024 embedding dimensions and can handle input sequences up to 512 tokens long. It outperforms other Chinese embedding models like [bge-large-zh](https://aimodels.fyi/models/huggingFace/bge-base-zh-dmetsoul) and [piccolo-base-zh](https://aimodels.fyi/models/huggingFace/sensenova) on the C-MTEB benchmark, achieving an average score of 64.11 across 35 datasets.

## Model Inputs and Outputs

### Inputs
- Text sequences up to 512 tokens long

### Outputs
- 1024-dimensional text embeddings that capture the semantic meaning of the input text

## Capabilities

The `piccolo-large-zh` model is highly capable at encoding Chinese text into semantic representations. These embeddings can be used for a variety of downstream tasks, such as:

- Information retrieval: The embeddings can be used to find relevant documents or passages given a query.
- Semantic search: The model can be used to find similar documents or passages based on their semantic content.
- Text classification: The embeddings can be used as features for training text classification models.
- Paraphrase detection: The model can be used to identify paraphrases of a given input text.

## What Can I Use It For?

The `piccolo-large-zh` model can be used in a wide range of applications that involve working with Chinese text. Some potential use cases include:

- **Search and Recommendation**: Use the embeddings to build semantic search engines or recommendation systems for Chinese content.
- **Content Clustering and Organization**: Group related Chinese documents or passages based on their semantic similarity.
- **Text Analytics and Insights**: Extract meaningful insights from Chinese text data by leveraging the model's ability to capture semantic meaning.
- **Multilingual Applications**: Combine `piccolo-large-zh` with other language models to build cross-lingual applications.

## Things to Try

One interesting aspect of the `piccolo-large-zh` model is its ability to handle long input sequences, up to 512 tokens. This makes it well-suited for tasks involving long-form Chinese text, such as document retrieval or question answering. You could try experimenting with the model's performance on such tasks and see how it compares to other Chinese language models.

Another interesting avenue to explore would be to fine-tune the `piccolo-large-zh` model on domain-specific data, such as scientific literature or legal documents, to see if it can capture specialized semantic knowledge in those areas. This could lead to improved performance on tasks like technical search or legal document classification.