![icon](/aspire/acge_text_embedding/resolve/main/img/logo.png)

[](#acge-model)acge model
-------------------------

acge[](https://www.intsig.com/)[TextIn](https://www.textin.com/)CB

[yanhui\_he@intsig.net](mailto:yanhui_he@intsig.net)[simon\_liu@intsig.net](mailto:simon_liu@intsig.net)[](https://huggingface.co/aspire/acge_text_embedding/blob/main/img/wx.jpg)[min\_du@intsig.net](mailto:min_du@intsig.net)[HR](https://huggingface.co/aspire/acge_text_embedding/blob/main/img/hr.jpg)

acge[Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147)

[![matryoshka-small](/aspire/acge_text_embedding/resolve/main/img/matryoshka-small.gif)](/aspire/acge_text_embedding/blob/main/img/matryoshka-small.gif)

10241792

Model Name

Model Size (GB)

Dimension

Sequence Length

Language

Need instruction for retrieval?

acge-text-embedding

0.65

\[1024, 1792\]

1024

Chinese

NO

[](#metric)Metric
-----------------

#### [](#c-mteb-leaderboard-chinese)C-MTEB leaderboard (Chinese)

4(A10 A100)result [infgrad](https://huggingface.co/infgrad)Sequence Length512

Model Name

GPU

tensor-type

Model Size (GB)

Dimension

Sequence Length

Average (35)

Classification (9)

Clustering (4)

Pair Classification (2)

Reranking (4)

Retrieval (8)

STS (8)

acge\_text\_embedding

NVIDIA TESLA A10

bfloat16

0.65

1792

1024

68.91

72.76

58.22

87.82

67.67

72.48

62.24

acge\_text\_embedding

NVIDIA TESLA A100

bfloat16

0.65

1792

1024

68.91

72.77

58.35

87.82

67.53

72.48

62.24

acge\_text\_embedding

NVIDIA TESLA A100

float16

0.65

1792

1024

68.99

72.76

58.68

87.84

67.89

72.49

62.24

acge\_text\_embedding

NVIDIA TESLA A100

float32

0.65

1792

1024

68.98

72.76

58.58

87.83

67.91

72.49

62.24

acge\_text\_embedding

NVIDIA TESLA A100

float16

0.65

1792

768

68.95

72.76

58.68

87.84

67.86

72.48

62.07

acge\_text\_embedding

NVIDIA TESLA A100

float16

0.65

1792

512

69.07

72.75

58.7

87.84

67.99

72.93

62.09

#### [](#reproduce-our-results)Reproduce our results

**C-MTEB:**

    import torch
    import argparse
    import functools
    from C_MTEB.tasks import *
    from typing import List, Dict
    from sentence_transformers import SentenceTransformer
    from mteb import MTEB, DRESModel
    
    
    class RetrievalModel(DRESModel):
        def __init__(self, encoder, **kwargs):
            self.encoder = encoder
    
        def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
            input_texts = ['{}'.format(q) for q in queries]
            return self._do_encode(input_texts)
    
        def encode_corpus(self, corpus: List[Dict[str, str]], **kwargs) -> np.ndarray:
            input_texts = ['{} {}'.format(doc.get('title', ''), doc['text']).strip() for doc in corpus]
            input_texts = ['{}'.format(t) for t in input_texts]
            return self._do_encode(input_texts)
    
        @torch.no_grad()
        def _do_encode(self, input_texts: List[str]) -> np.ndarray:
            return self.encoder.encode(
                sentences=input_texts,
                batch_size=512,
                normalize_embeddings=True,
                convert_to_numpy=True
            )
    
    
    def get_args():
        parser = argparse.ArgumentParser()
        parser.add_argument('--model_name_or_path', default="acge_text_embedding", type=str)
        parser.add_argument('--task_type', default=None, type=str)
        parser.add_argument('--pooling_method', default='cls', type=str)
        parser.add_argument('--output_dir', default='zh_results',
                            type=str, help='output directory')
        parser.add_argument('--max_len', default=1024, type=int, help='max length')
        return parser.parse_args()
    
    
    if __name__ == '__main__':
        args = get_args()
        encoder = SentenceTransformer(args.model_name_or_path).half()
        encoder.encode = functools.partial(encoder.encode, normalize_embeddings=True)
        encoder.max_seq_length = int(args.max_len)
    
        task_names = [t.description["name"] for t in MTEB(task_types=args.task_type,
                                                          task_langs=['zh', 'zh-CN']).tasks]
        TASKS_WITH_PROMPTS = ["T2Retrieval", "MMarcoRetrieval", "DuRetrieval", "CovidRetrieval", "CmedqaRetrieval",
                              "EcomRetrieval", "MedicalRetrieval", "VideoRetrieval"]
        for task in task_names:
            evaluation = MTEB(tasks=[task], task_langs=['zh', 'zh-CN'])
            if task in TASKS_WITH_PROMPTS:
                evaluation.run(RetrievalModel(encoder), output_folder=args.output_dir, overwrite_results=False)
            else:
                evaluation.run(encoder, output_folder=args.output_dir, overwrite_results=False)
    
    

[](#usage)Usage
---------------

#### [](#acge-)acge 

sentence-transformer

    from sentence_transformers import SentenceTransformer
    
    sentences = ["1", "2"]
    model = SentenceTransformer('acge_text_embedding')
    print(model.max_seq_length)
    embeddings_1 = model.encode(sentences, normalize_embeddings=True)
    embeddings_2 = model.encode(sentences, normalize_embeddings=True)
    similarity = embeddings_1 @ embeddings_2.T
    print(similarity)
    

sentence-transformer

    from sklearn.preprocessing import normalize
    from sentence_transformers import SentenceTransformer
    
    sentences = ["1", "2"]
    model = SentenceTransformer('acge_text_embedding')
    embeddings = model.encode(sentences, normalize_embeddings=False)
    matryoshka_dim = 1024
    embeddings = embeddings[..., :matryoshka_dim]  # Shrink the embedding dimensions
    embeddings = normalize(embeddings, norm="l2", axis=1)
    print(embeddings.shape)
    # => (2, 1024)

## Model overview

The `acge_text_embedding` model is a text embedding model developed by the team at [aspire](https://aimodels.fyi/creators/huggingFace/aspire). This model uses the [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) approach to map text to a vector representation. 

The `acge_text_embedding` model is similar to other [BGE](https://aimodels.fyi/models/huggingFace/bge-large-en-baai) and [text2vec](https://aimodels.fyi/models/huggingFace/text2vec-base-chinese-paraphrase-shibing624) text embedding models in that it can be used for tasks like retrieval, classification, and semantic search. However, the `acge_text_embedding` model was trained specifically on Chinese text data and may perform better on Chinese language tasks compared to the English-focused models.

## Model inputs and outputs

### Inputs
- Chinese text data in the form of strings

### Outputs
- 1792-dimensional vector representations of the input text

## Capabilities

The `acge_text_embedding` model can map any Chinese text to a low-dimensional dense vector. These vector representations can then be used for a variety of downstream tasks such as:

- Retrieval: Finding relevant passages or documents given a query
- Classification: Classifying text into different categories
- Clustering: Grouping similar text together
- Semantic search: Finding semantically similar text

## What can I use it for?

The `acge_text_embedding` model can be useful for a range of applications that require understanding the semantic meaning of Chinese text, such as:

- Building search engines or recommendation systems for Chinese content
- Powering chatbots or virtual assistants that interact with users in Chinese
- Analyzing Chinese text data for insights, such as in market research or social media monitoring

## Things to try

One interesting thing to try with the `acge_text_embedding` model is using it to find similar Chinese text passages or documents. By comparing the vector representations of different pieces of text, you can identify content that is semantically related, even if the wording is different. This can be useful for tasks like:

- Building a content recommendation system to suggest related articles or products to users
- Identifying duplicate or near-duplicate content in a large corpus of Chinese text
- Clustering Chinese text data into meaningful groups based on the underlying semantics

To get started, you can use the `acge_text_embedding` model through the [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) library, which provides a simple interface for working with the model.