[](#entity-recognition-english-foundation-model-by-numind-)Entity Recognition English Foundation Model by NuMind 
=====================================================================================================================

This model provides great token embedding for the Entity Recognition task in English.

This is the prototype of the model from our [**Paper**](https://arxiv.org/abs/2402.15343): **NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data**

We suggest using **newer version of this model: [NuNER v1.0](https://huggingface.co/numind/NuNER-v1.0)** - this is the model reported in the paper.

**Checkout other models by NuMind:**

*   SOTA Multilingual Entity Recognition Foundation Model: [link](https://huggingface.co/numind/entity-recognition-multilingual-general-sota-v1)
*   SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)

[](#about)About
---------------

[Roberta-base](https://huggingface.co/roberta-base) fine-tuned on [NuNER data](https://huggingface.co/datasets/numind/NuNER).

**Metrics:**

Read more about evaluation protocol & datasets in our [paper](https://arxiv.org/abs/2402.15343) and [blog post](https://www.numind.ai/blog/a-foundation-model-for-entity-recognition).

Model

F1 macro

RoBERTa-base

0.7129

ours

0.7500

ours + two emb

0.7686

[](#usage)Usage
---------------

Embeddings can be used out of the box or fine-tuned on specific datasets.

Get embeddings:

    import torch
    import transformers
    
    
    model = transformers.AutoModel.from_pretrained(
        'numind/NuNER-v0.1',
        output_hidden_states=True
    )
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        'numind/NuNER-v0.1'
    )
    
    text = [
        "NuMind is an AI company based in Paris and USA.",
        "See other models from us on https://huggingface.co/numind"
    ]
    encoded_input = tokenizer(
        text,
        return_tensors='pt',
        padding=True,
        truncation=True
    )
    output = model(**encoded_input)
    
    # for better quality
    emb = torch.cat(
        (output.hidden_states[-1], output.hidden_states[-7]),
        dim=2
    )
    
    # for better speed
    # emb = output.hidden_states[-1]
    

[](#citation)Citation
---------------------

    @misc{bogdanov2024nuner,
          title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data}, 
          author={Sergei Bogdanov and Alexandre Constantin and Timothe Bernard and Benoit Crabb and Etienne Bernard},
          year={2024},
          eprint={2402.15343},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }

## Model overview

The `NuNER-v0.1` model is an English language entity recognition model fine-tuned from the RoBERTa-base model by the team at [NuMind](https://aimodels.fyi/creators/huggingFace/numind). This model provides strong token embeddings for entity recognition tasks in English. It was the prototype for the [NuNER v1.0 model](https://huggingface.co/numind/NuNER-v1.0), which is the version reported in the [paper](https://arxiv.org/abs/2402.15343) introducing the model.

The `NuNER-v0.1` model outperforms the base RoBERTa-base model on entity recognition, achieving an F1 macro score of 0.7500 compared to 0.7129 for RoBERTa-base. Combining the last and second-to-last hidden states further improves performance to 0.7686 F1 macro.

Other notable entity recognition models include [bert-base-NER](https://aimodels.fyi/models/huggingFace/bert-base-ner-dslim), a BERT-base model fine-tuned on the CoNLL-2003 dataset, and [roberta-large-ner-english](https://aimodels.fyi/models/huggingFace/roberta-large-ner-english-jean-baptiste), a RoBERTa-large model fine-tuned for English NER.

## Model inputs and outputs

### Inputs
- **Text**: The model takes in raw text as input, which it then tokenizes and encodes for processing.

### Outputs
- **Entity predictions**: The model outputs a sequence of entity predictions for the input text, classifying each token as belonging to one of the four entity types: location (LOC), organization (ORG), person (PER), or miscellaneous (MISC).
- **Token embeddings**: The model can also be used to extract token-level embeddings, which can be useful for downstream tasks. The author suggests using the concatenation of the last and second-to-last hidden states for better quality embeddings.

## Capabilities

The `NuNER-v0.1` model is highly capable at recognizing entities in English text, surpassing the base RoBERTa model on the CoNLL-2003 NER dataset. It can accurately identify locations, organizations, people, and miscellaneous entities within input text. This makes it a powerful tool for applications that require understanding the entities mentioned in documents, such as information extraction, knowledge graph construction, or content analysis.

## What can I use it for?

The `NuNER-v0.1` model can be used for a variety of applications that involve identifying and extracting entities from English text. Some potential use cases include:

- **Information Extraction**: The model can be used to automatically extract key entities (people, organizations, locations, etc.) from documents, articles, or other text-based data sources.
- **Knowledge Graph Construction**: The entity predictions from the model can be used to populate a knowledge graph with structured information about the entities mentioned in a corpus.
- **Content Analysis**: By understanding the entities present in text, the model can enable more sophisticated content analysis tasks, such as topic modeling, sentiment analysis, or text summarization.
- **Chatbots and Virtual Assistants**: The entity recognition capabilities of the model can be leveraged to improve the natural language understanding of chatbots and virtual assistants, allowing them to better comprehend user queries and respond appropriately.

## Things to try

One interesting aspect of the `NuNER-v0.1` model is its ability to produce high-quality token embeddings by concatenating the last and second-to-last hidden states. These embeddings could be used as input features for a wide range of downstream NLP tasks, such as text classification, named entity recognition, or relation extraction. Experimenting with different ways of utilizing these embeddings, such as fine-tuning on domain-specific datasets or combining them with other model architectures, could lead to exciting new applications and performance improvements.

Another avenue to explore would be comparing the `NuNER-v0.1` model's performance on different types of text data, beyond the news-based CoNLL-2003 dataset used for evaluation. Trying the model on more informal, conversational text (e.g., social media, emails, chat logs) could uncover interesting insights about its generalization capabilities and potential areas for improvement.