[](#ember-v1)ember-v1
=====================

![](https://console.llmrails.com/assets/img/logo-black.svg)

This model has been trained on an extensive corpus of text pairs that encompass a broad spectrum of domains, including finance, science, medicine, law, and various others. During the training process, we incorporated techniques derived from the [RetroMAE](https://arxiv.org/abs/2205.12035) and [SetFit](https://arxiv.org/abs/2209.11055) research papers.

We are pleased to offer this model as an API service through our platform, [LLMRails](https://llmrails.com/?ref=ember-v1). If you are interested, please don't hesitate to sign up.

### [](#plans)Plans

*   The research paper will be published soon.
*   The v2 of the model is currently in development and will feature an extended maximum sequence length of 4,000 tokens.

[](#usage)Usage
---------------

Use with API request:

    curl --location 'https://api.llmrails.com/v1/embeddings' \
    --header 'X-API-KEY: {token}' \
    --header 'Content-Type: application/json' \
    --data '{
       "input": ["This is an example sentence"],
       "model":"embedding-english-v1" # equals to ember-v1
    }'
    

API docs: [https://docs.llmrails.com/embedding/embed-text](https://docs.llmrails.com/embedding/embed-text)  
Langchain plugin: [https://python.langchain.com/docs/integrations/text\_embedding/llm\_rails](https://python.langchain.com/docs/integrations/text_embedding/llm_rails)

Use with transformers:

    import torch.nn.functional as F
    from torch import Tensor
    from transformers import AutoTokenizer, AutoModel
    
    def average_pool(last_hidden_states: Tensor,
                     attention_mask: Tensor) -> Tensor:
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    
    input_texts = [
        "This is an example sentence",
        "Each sentence is converted"
    ]
    
    tokenizer = AutoTokenizer.from_pretrained("llmrails/ember-v1")
    model = AutoModel.from_pretrained("llmrails/ember-v1")
    
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
    
    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    # (Optionally) normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    scores = (embeddings[:1] @ embeddings[1:].T) * 100
    print(scores.tolist())
    

Use with sentence-transformers:

    from sentence_transformers import SentenceTransformer
    from sentence_transformers.util import cos_sim
    
    sentences = [
        "This is an example sentence",
        "Each sentence is converted"
    ]
    
    model = SentenceTransformer('llmrails/ember-v1')
    embeddings = model.encode(sentences)
    print(cos_sim(embeddings[0], embeddings[1]))
    

[](#massive-text-embedding-benchmark-mteb-evaluation)Massive Text Embedding Benchmark (MTEB) Evaluation
-------------------------------------------------------------------------------------------------------

Our model achieve state-of-the-art performance on [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard)

Model Name

Dimension

Sequence Length

Average (56)

[bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)

1024

512

64.23

[bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)

768

512

63.55

[ember-v1](https://huggingface.co/llmrails/emmbedding-en-v1)

1024

512

**63.54**

[text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings/types-of-embedding-models)

1536

8191

60.99

### [](#limitation)Limitation

This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.

![](https://pixel.llmrails.com/hf/2AtscRthisA1rZzQr8T7Zm)

## Model overview

The `ember-v1` model is a powerful text embedding model developed by the team at [LLMRails](https://aimodels.fyi/creators/huggingFace/llmrails). The model has been trained on an extensive corpus of text pairs spanning a broad range of domains, including finance, science, medicine, law, and more. During training, the team incorporated techniques from the [RetroMAE](https://arxiv.org/abs/2205.12035) and [SetFit](https://arxiv.org/abs/2209.11055) research papers.

Compared to similar models like [multilingual-e5-large](https://aimodels.fyi/models/huggingFace/multilingual-e5-large-intfloat), `ember-v1` offers a more expansive training dataset and enhanced capabilities for handling diverse text. The upcoming v2 release will further extend the model's abilities by increasing the maximum sequence length to 4,000 tokens.

## Model inputs and outputs

### Inputs
- Text sequences of up to 512 tokens

### Outputs
- Dense vector embeddings representing the semantic content of the input text

## Capabilities

The `ember-v1` model excels at capturing the underlying meaning and context of text, making it a valuable tool for a variety of natural language processing tasks. Its robust performance across multiple domains allows it to be leveraged for applications such as information retrieval, text classification, and semantic search.

## What can I use it for?

The `ember-v1` model can be used in a wide range of projects that require understanding and processing text data. For example, you could use it to build intelligent search engines that return highly relevant results, or develop advanced chatbots and virtual assistants that can engage in more natural and contextual conversations.

The model's capabilities also lend themselves well to financial and legal applications, where the ability to accurately analyze and extract insights from large volumes of text is crucial. Researchers and healthcare professionals could leverage `ember-v1` to streamline literature reviews, identify relevant medical studies, or assist in clinical decision-making.

## Things to try

One interesting aspect of the `ember-v1` model is its ability to handle text from diverse domains. Try experimenting with inputs from different fields, such as scientific papers, financial reports, or legal documents, to see how the model performs. You can also explore the model's capabilities in tasks like cross-domain retrieval, where you search for relevant information across multiple subject areas.

Another area to explore is the model's performance on longer text sequences. As the upcoming v2 release will extend the maximum sequence length, you could test the model's ability to capture the semantic context of lengthier passages, which could be particularly useful for applications like summarization or question-answering.