[](#shibing624text2vec-base-chinese-sentence)shibing624/text2vec-base-chinese-sentence
======================================================================================

This is a CoSENT(Cosine Sentence) model: shibing624/text2vec-base-chinese-sentence.

It maps sentences to a 768 dimensional dense vector space and can be used for tasks like sentence embeddings, text matching or semantic search.

*   training dataset: [https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-sentence-dataset](https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-sentence-dataset)
*   base model: nghuyong/ernie-3.0-base-zh
*   max\_seq\_length: 256
*   best epoch: 3
*   sentence embedding dim: 768

[](#evaluation)Evaluation
-------------------------

For an automated evaluation of this model, see the _Evaluation Benchmark_: [text2vec](https://github.com/shibing624/text2vec)

### [](#release-models)Release Models

*   release

Arch

BaseModel

Model

ATEC

BQ

LCQMC

PAWSX

STS-B

SOHU-dd

SOHU-dc

Avg

QPS

Word2Vec

word2vec

[w2v-light-tencent-chinese](https://ai.tencent.com/ailab/nlp/en/download.html)

20.00

31.49

59.46

2.57

55.78

55.04

20.70

35.03

23769

SBERT

xlm-roberta-base

[sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)

18.42

38.52

63.96

10.14

78.90

63.01

52.28

46.46

3138

Instructor

hfl/chinese-roberta-wwm-ext

[moka-ai/m3e-base](https://huggingface.co/moka-ai/m3e-base)

41.27

63.81

74.87

12.20

76.96

75.83

60.55

57.93

2980

CoSENT

hfl/chinese-macbert-base

[shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese)

31.93

42.67

70.16

17.21

79.30

70.27

50.42

51.61

3008

CoSENT

hfl/chinese-lert-large

[GanymedeNil/text2vec-large-chinese](https://huggingface.co/GanymedeNil/text2vec-large-chinese)

32.61

44.59

69.30

14.51

79.44

73.01

59.04

53.12

2092

CoSENT

nghuyong/ernie-3.0-base-zh

[shibing624/text2vec-base-chinese-sentence](https://huggingface.co/shibing624/text2vec-base-chinese-sentence)

43.37

61.43

73.48

38.90

78.25

70.60

53.08

59.87

3089

CoSENT

nghuyong/ernie-3.0-base-zh

[shibing624/text2vec-base-chinese-paraphrase](https://huggingface.co/shibing624/text2vec-base-chinese-paraphrase)

44.89

63.58

74.24

40.90

78.93

76.70

63.30

**63.08**

3066

CoSENT

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

[shibing624/text2vec-base-multilingual](https://huggingface.co/shibing624/text2vec-base-multilingual)

32.39

50.33

65.64

32.56

74.45

68.88

51.17

53.67

4004



*   spearman
*   `shibing624/text2vec-base-chinese`CoSENT`hfl/chinese-macbert-base`STS-BSTS-B[examples/training\_sup\_text\_matching\_model.py](https://github.com/shibing624/text2vec/blob/master/examples/training_sup_text_matching_model.py)HF model hub
*   `shibing624/text2vec-base-chinese-sentence`CoSENT`nghuyong/ernie-3.0-base-zh`STS[shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset](https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-sentence-dataset)NLI[examples/training\_sup\_text\_matching\_model\_jsonl\_data.py](https://github.com/shibing624/text2vec/blob/master/examples/training_sup_text_matching_model_jsonl_data.py)HF model hubs2s(vs)
*   `shibing624/text2vec-base-chinese-paraphrase`CoSENT`nghuyong/ernie-3.0-base-zh`STS[shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset](https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-paraphrase-dataset)[shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset](https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-sentence-dataset)s2p(sentence to paraphrase)NLISOTA[examples/training\_sup\_text\_matching\_model\_jsonl\_data.py](https://github.com/shibing624/text2vec/blob/master/examples/training_sup_text_matching_model_jsonl_data.py)HF model hubs2p(vs)
*   `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`SBERT`paraphrase-MiniLM-L12-v2`
*   `w2v-light-tencent-chinese`Word2VecCPU

 shibing624/text2vec-base-chinese-nli [tag1.0](https://huggingface.co/shibing624/text2vec-base-chinese-sentence/tree/1.0)

[](#usage-text2vec)Usage (text2vec)
-----------------------------------

Using this model becomes easy when you have [text2vec](https://github.com/shibing624/text2vec) installed:

    pip install -U text2vec
    

Then you can use the model like this:

    from text2vec import SentenceModel
    sentences = ['', '']
    
    model = SentenceModel('shibing624/text2vec-base-chinese-sentence')
    embeddings = model.encode(sentences)
    print(embeddings)
    

[](#usage-huggingface-transformers)Usage (HuggingFace Transformers)
-------------------------------------------------------------------

Without [text2vec](https://github.com/shibing624/text2vec), you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

Install transformers:

    pip install transformers
    

Then load model and predict:

    from transformers import BertTokenizer, BertModel
    import torch
    
    # Mean Pooling - Take attention mask into account for correct averaging
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    # Load model from HuggingFace Hub
    tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese-sentence')
    model = BertModel.from_pretrained('shibing624/text2vec-base-chinese-sentence')
    sentences = ['', '']
    # Tokenize sentences
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    # Perform pooling. In this case, mean pooling.
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

[](#usage-sentence-transformers)Usage (sentence-transformers)
-------------------------------------------------------------

[sentence-transformers](https://github.com/UKPLab/sentence-transformers) is a popular library to compute dense vector representations for sentences.

Install sentence-transformers:

    pip install -U sentence-transformers
    

Then load model and predict:

    from sentence_transformers import SentenceTransformer
    
    m = SentenceTransformer("shibing624/text2vec-base-chinese-sentence")
    sentences = ['', '']
    
    sentence_embeddings = m.encode(sentences)
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

[](#full-model-architecture)Full Model Architecture
---------------------------------------------------

    CoSENT(
      (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: ErnieModel 
      (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
    )
    

[](#intended-uses)Intended uses
-------------------------------

Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.

By default, input text longer than 256 word pieces is truncated.

[](#training-procedure)Training procedure
-----------------------------------------

### [](#pre-training)Pre-training

We use the pretrained [`nghuyong/ernie-3.0-base-zh`](https://huggingface.co/nghuyong/ernie-3.0-base-zh) model. Please refer to the model card for more detailed information about the pre-training procedure.

### [](#fine-tuning)Fine-tuning

We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the rank loss by comparing with true pairs and false pairs.

[](#citing--authors)Citing & Authors
------------------------------------

This model was trained by [text2vec](https://github.com/shibing624/text2vec).

If you find this model helpful, feel free to cite:

    @software{text2vec,
      author = {Ming Xu},
      title = {text2vec: A Tool for Text to Vector},
      year = {2023},
      url = {https://github.com/shibing624/text2vec},
    }

## Model overview

The `text2vec-base-chinese-sentence` model is a CoSENT (Cosine Sentence) model developed by [shibing624](https://aimodels.fyi/creators/huggingFace/shibing624). It maps Chinese sentences to a 768-dimensional dense vector space, which can be used for tasks like sentence embeddings, text matching, or semantic search. This model is based on the [nghuyong/ernie-3.0-base-zh](https://huggingface.co/nghuyong/ernie-3.0-base-zh) model and was trained on a large dataset of natural language inference (NLI) data.

Similar models developed by shibing624 include [text2vec-base-chinese-paraphrase](https://aimodels.fyi/models/huggingFace/text2vec-base-chinese-paraphrase-shibing624), which was trained on paraphrase data, and [text2vec-base-multilingual](https://aimodels.fyi/models/huggingFace/text2vec-base-multilingual-shibing624), which supports multiple languages. These models can be used interchangeably for sentence embedding tasks, with the specific model chosen depending on the language and task requirements.

## Model inputs and outputs

### Inputs
- Chinese text, with a maximum sequence length of 256 word pieces.

### Outputs
- A 768-dimensional dense vector representation of the input sentence, capturing its semantic meaning.

## Capabilities

The `text2vec-base-chinese-sentence` model can be used to generate high-quality sentence embeddings for Chinese text. These embeddings can be used in a variety of natural language processing tasks, such as:

- **Semantic search**: The sentence embeddings can be used to find similar sentences or documents based on their meaning, rather than just keyword matching.
- **Text clustering**: The sentence embeddings can be used to group related sentences or documents together based on their semantic similarity.
- **Text matching**: The sentence embeddings can be used to determine the degree of similarity between two sentences, which can be useful for tasks like paraphrase identification or duplicate detection.

## What can I use it for?

The `text2vec-base-chinese-sentence` model can be used in a wide range of applications that involve processing Chinese text, such as:

- **Customer service chatbots**: The sentence embeddings can be used to understand the intent behind user queries and provide relevant responses.
- **Content recommendation systems**: The sentence embeddings can be used to find similar articles or products based on their semantic content, rather than just keywords.
- **Plagiarism detection**: The sentence embeddings can be used to identify similar passages of text, which can be useful for detecting plagiarism.

## Things to try

One interesting aspect of the `text2vec-base-chinese-sentence` model is its performance on the [STS-B](https://www.aclweb.org/anthology/S17-2017/) (Semantic Textual Similarity Benchmark) task, where it achieved a Spearman correlation of 78.25. This suggests that the model is particularly well-suited for tasks that require understanding the semantic similarity between sentences.

You could try using the model's sentence embeddings in a variety of downstream tasks, such as text classification, question answering, or information retrieval. You could also experiment with fine-tuning the model on your own domain-specific data to improve its performance on your particular use case.