Get a weekly rundown of the latest AI models and research... subscribe! https://aimodels.substack.com/

Sensenova

Models by this creator

⚙️

piccolo-large-zh

sensenova

Total Score

58

The piccolo-large-zh is a general text embedding model for Chinese, powered by the General Model Group from SenseTime Research. Inspired by E5 and GTE, piccolo is trained using a two-stage pipeline. First, the model is trained on 400 million weakly supervised Chinese text pairs collected from the internet, using a pair (text and text pos) softmax contrastive loss. In the second stage, the model is fine-tuned on 20 million human-labeled Chinese text pairs, using a triplet (text, text_pos, text_neg) contrastive loss. This approach enables piccolo-large-zh to capture rich semantic information and perform well on a variety of downstream tasks. The piccolo-large-zh model has 1024 embedding dimensions and can handle input sequences up to 512 tokens long. It outperforms other Chinese embedding models like bge-large-zh and piccolo-base-zh on the C-MTEB benchmark, achieving an average score of 64.11 across 35 datasets. Model Inputs and Outputs Inputs Text sequences up to 512 tokens long Outputs 1024-dimensional text embeddings that capture the semantic meaning of the input text Capabilities The piccolo-large-zh model is highly capable at encoding Chinese text into semantic representations. These embeddings can be used for a variety of downstream tasks, such as: Information retrieval: The embeddings can be used to find relevant documents or passages given a query. Semantic search: The model can be used to find similar documents or passages based on their semantic content. Text classification: The embeddings can be used as features for training text classification models. Paraphrase detection: The model can be used to identify paraphrases of a given input text. What Can I Use It For? The piccolo-large-zh model can be used in a wide range of applications that involve working with Chinese text. Some potential use cases include: Search and Recommendation**: Use the embeddings to build semantic search engines or recommendation systems for Chinese content. Content Clustering and Organization**: Group related Chinese documents or passages based on their semantic similarity. Text Analytics and Insights**: Extract meaningful insights from Chinese text data by leveraging the model's ability to capture semantic meaning. Multilingual Applications**: Combine piccolo-large-zh with other language models to build cross-lingual applications. Things to Try One interesting aspect of the piccolo-large-zh model is its ability to handle long input sequences, up to 512 tokens. This makes it well-suited for tasks involving long-form Chinese text, such as document retrieval or question answering. You could try experimenting with the model's performance on such tasks and see how it compares to other Chinese language models. Another interesting avenue to explore would be to fine-tune the piccolo-large-zh model on domain-specific data, such as scientific literature or legal documents, to see if it can capture specialized semantic knowledge in those areas. This could lead to improved performance on tasks like technical search or legal document classification.

Read more

Updated 5/15/2024