[](#structure-extraction-model-by-numind-)Structure Extraction Model by NuMind 
===================================================================================

NuExtract is a version of [phi-3-mini](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), fine-tuned on a private high-quality synthetic dataset for information extraction. To use the model, provide an input text (less than 2000 tokens) and a JSON template describing the information you need to extract.

Note: This model is purely extractive, so all text output by the model is present as is in the original text. You can also provide an example of output formatting to help the model understand your task more precisely.

Try it here: [https://huggingface.co/spaces/numind/NuExtract](https://huggingface.co/spaces/numind/NuExtract)

We also provide a tiny(0.5B) and large(7B) version of this model: [NuExtract-tiny](https://huggingface.co/numind/NuExtract-tiny) and [NuExtract-large](https://huggingface.co/numind/NuExtract-large)

**Checkout other models by NuMind:**

*   SOTA Zero-shot NER Model [NuNER Zero](https://huggingface.co/numind/NuNER_Zero)
*   SOTA Multilingual Entity Recognition Foundation Model: [link](https://huggingface.co/numind/entity-recognition-multilingual-general-sota-v1)
*   SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)

[](#benchmark)Benchmark
-----------------------

Benchmark 0 shot (will release soon):

![](/numind/NuExtract/resolve/main/result.png)

Benchmark fine-tunning (see blog post):

![](/numind/NuExtract/resolve/main/result_ft.png)

[](#usage)Usage
---------------

To use the model:

    import json
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    
    def predict_NuExtract(model, tokenizer, text, schema, example=["", "", ""]):
        schema = json.dumps(json.loads(schema), indent=4)
        input_llm =  "<|input|>\n### Template:\n" +  schema + "\n"
        for i in example:
          if i != "":
              input_llm += "### Example:\n"+ json.dumps(json.loads(i), indent=4)+"\n"
        
        input_llm +=  "### Text:\n"+text +"\n<|output|>\n"
        input_ids = tokenizer(input_llm, return_tensors="pt",truncation = True, max_length=4000).to("cuda")
    
        output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
        return output.split("<|output|>")[1].split("<|end-output|>")[0]
    
    
    # We recommend using bf16 as it results in negligable performance loss
    model = AutoModelForCausalLM.from_pretrained("numind/NuExtract", torch_dtype=torch.bfloat16, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract", trust_remote_code=True)
    
    model.to("cuda")
    
    model.eval()
    
    text = """We introduce Mistral 7B, a 7billion-parameter language model engineered for
    superior performance and efficiency. Mistral 7B outperforms the best open 13B
    model (Llama 2) across all evaluated benchmarks, and the best released 34B
    model (Llama 1) in reasoning, mathematics, and code generation. Our model
    leverages grouped-query attention (GQA) for faster inference, coupled with sliding
    window attention (SWA) to effectively handle sequences of arbitrary length with a
    reduced inference cost. We also provide a model fine-tuned to follow instructions,
    Mistral 7B  Instruct, that surpasses Llama 2 13B  chat model both on human and
    automated benchmarks. Our models are released under the Apache 2.0 license.
    Code: https://github.com/mistralai/mistral-src
    Webpage: https://mistral.ai/news/announcing-mistral-7b/"""
    
    schema = """{
        "Model": {
            "Name": "",
            "Number of parameters": "",
            "Number of max token": "",
            "Architecture": []
        },
        "Usage": {
            "Use case": [],
            "Licence": ""
        }
    }"""
    
    prediction = predict_NuExtract(model, tokenizer, text, schema, example=["","",""])
    print(prediction)

## Model overview

`NuExtract` is a version of the [phi-3-mini](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) model, fine-tuned by [numind](https://aimodels.fyi/creators/huggingFace/numind) on a private high-quality synthetic dataset for information extraction tasks. Compared to the base model, `NuExtract` is tailored for extracting specific information from input text. Other similar models from numind include the larger [NuExtract-large](https://aimodels.fyi/models/huggingFace/nuextract-large-numind) and smaller [NuExtract-tiny](https://aimodels.fyi/models/huggingFace/nuextract-tiny-numind) versions.

## Model inputs and outputs

The `NuExtract` model takes two main inputs: a text passage (up to 2000 tokens) and a JSON template describing the information to extract. The model is purely extractive, meaning its output will consist of text directly present in the original input. Users can also provide an example output format to help the model understand the task more precisely.

### Inputs
- **Text passage**: A text document up to 2000 tokens in length
- **JSON template**: A JSON object describing the information to extract from the text

### Outputs
- **Extracted information**: The relevant text from the input passage, formatted according to the provided JSON template or example

## Capabilities

The `NuExtract` model excels at extracting specific pieces of information from input text. It can handle a variety of extraction tasks, such as pulling key facts, entities, or other structured data from documents. By fine-tuning the base phi-3-mini model, `NuExtract` has gained specialized capabilities for this type of information extraction while maintaining the strong reasoning and language understanding abilities of the original model.

## What can I use it for?

The `NuExtract` model could be useful for any application that requires extracting structured data from text, such as:

- Automating information retrieval from business documents or reports
- Populating databases or knowledge graphs from unstructured data sources
- Powering intelligent search or question-answering systems
- Summarizing key details from lengthy technical or scientific papers

Since `NuExtract` is a fine-tuned version of a larger language model, it can also serve as a starting point for further customization and fine-tuning to meet the needs of specific domains or use cases.

## Things to try

One interesting aspect of `NuExtract` is its ability to handle both the text input and the JSON template in a unified way. This allows for greater flexibility in how the extraction task is specified, as users can experiment with different template formats or even provide examples to guide the model's output. Developers could also explore combining `NuExtract` with other numind models, such as the [SOTA Multilingual Entity Recognition Foundation Model](https://aimodels.fyi/models/huggingFace/nuner-multilingual-v01-numind), to tackle more complex information extraction challenges.

[](#structure-extraction-model-by-numind-)Structure Extraction Model by NuMind 
===================================================================================

NuExtract-large is a version of [phi-3-small](https://huggingface.co/microsoft/Phi-3-small-8k-instruct), fine-tuned on a private high-quality synthetic dataset for information extraction. To use the model, provide an input text (less than 2000 tokens) and a JSON template describing the information you need to extract.

Note: This model is purely extractive, so all text output by the model is present as is in the original text. You can also provide an example of output formatting to help the model understand your task more precisely.

Try the base model here: [https://huggingface.co/spaces/numind/NuExtract](https://huggingface.co/spaces/numind/NuExtract)

We also provide a tiny (0.5B) and base (7B) version of this model: [NuExtract-tiny](https://huggingface.co/numind/NuExtract-tiny) and [NuExtract](https://huggingface.co/numind/NuExtract)

**Checkout other models by NuMind:**

*   SOTA Zero-shot NER Model [NuNER Zero](https://huggingface.co/numind/NuNER_Zero)
*   SOTA Multilingual Entity Recognition Foundation Model: [link](https://huggingface.co/numind/entity-recognition-multilingual-general-sota-v1)
*   SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)

[](#benchmark)Benchmark
-----------------------

Benchmark 0 shot (will release soon):

![](/numind/NuExtract-large/resolve/main/result.png)

Benchmark fine-tunning (see blog post):

![](/numind/NuExtract-large/resolve/main/result_ft.png)

[](#usage)Usage
---------------

To use the model:

    import json
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    
    
    def predict_NuExtract(model, tokenizer, text, schema, example=["","",""]):
        schema = json.dumps(json.loads(schema), indent=4)
        input_llm =  "<|input|>\n### Template:\n" +  schema + "\n"
        for i in example:
          if i != "":
              input_llm += "### Example:\n"+ json.dumps(json.loads(i), indent=4)+"\n"
        
        input_llm +=  "### Text:\n"+text +"\n<|output|>\n"
        input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to("cuda")
    
        output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
        return output.split("<|output|>")[1].split("<|end-output|>")[0]
    
    
    model = AutoModelForCausalLM.from_pretrained("numind/NuExtract", trust_remote_code=True, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract", trust_remote_code=True)
    
    model.to("cuda")
    
    model.eval()
    
    text = """We introduce Mistral 7B, a 7billion-parameter language model engineered for
    superior performance and efficiency. Mistral 7B outperforms the best open 13B
    model (Llama 2) across all evaluated benchmarks, and the best released 34B
    model (Llama 1) in reasoning, mathematics, and code generation. Our model
    leverages grouped-query attention (GQA) for faster inference, coupled with sliding
    window attention (SWA) to effectively handle sequences of arbitrary length with a
    reduced inference cost. We also provide a model fine-tuned to follow instructions,
    Mistral 7B  Instruct, that surpasses Llama 2 13B  chat model both on human and
    automated benchmarks. Our models are released under the Apache 2.0 license.
    Code: https://github.com/mistralai/mistral-src
    Webpage: https://mistral.ai/news/announcing-mistral-7b/"""
    
    schema = """{
        "Model": {
            "Name": "",
            "Number of parameters": "",
            "Number of token": "",
            "Architecture": []
        },
        "Usage": {
            "Use case": [],
            "Licence": ""
        }
    }"""
    
    prediction = predict_NuExtract(model, tokenizer, text, schema, example=["","",""])
    print(prediction)

## Model overview

`NuExtract-large` is a version of the [Phi-3-small](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) model, fine-tuned by [NuMind](https://aimodels.fyi/creators/huggingFace/numind) on a private high-quality synthetic dataset for information extraction. It is a text-to-text model designed for extracting structured information from input text.

Compared to similar models like [NuNER-v0.1](https://aimodels.fyi/models/huggingFace/nuner-v01-numind) and [NuNER-multilingual-v0.1](https://aimodels.fyi/models/huggingFace/nuner-multilingual-v01-numind), which focus on entity recognition, `NuExtract-large` is specialized for more general information extraction tasks. It can extract relevant information from input text based on a provided JSON template.

## Model inputs and outputs

`NuExtract-large` is a text-to-text model, taking in input text and a JSON template as input, and generating the extracted information as output.

### Inputs
- **Input text**: The input text can be up to 2000 tokens long. It contains the information that the model will extract from.
- **JSON template**: A JSON template that describes the information the user wants to extract from the input text.
- **Example output**: An optional example of the desired output formatting to help the model understand the task.

### Outputs
- **Extracted information**: The model's attempt at extracting the requested information from the input text, formatted according to the provided JSON template.

## Capabilities

`NuExtract-large` is capable of extracting structured information from input text based on a provided template. It can handle a variety of information extraction tasks, from extracting key entities and facts to summarizing longer passages of text.

The model's fine-tuning on a high-quality synthetic dataset gives it strong performance on information extraction, as evidenced by its benchmarked results. It outperforms the base `Phi-3-small` model on these tasks.

## What can I use it for?

`NuExtract-large` could be useful for a variety of applications that require extracting structured information from text, such as:

- Automating data entry from documents or web pages
- Summarizing long passages of text into key facts and entities
- Powering intelligent search and question-answering systems
- Streamlining business processes by extracting relevant information

Companies could potentially monetize `NuExtract-large` by building applications and services that leverage its information extraction capabilities, such as [NuExtract](https://huggingface.co/spaces/numind/NuExtract) from the model's maintainer [NuMind](https://aimodels.fyi/creators/huggingFace/numind).

## Things to try

One interesting thing to try with `NuExtract-large` is using it to extract information from longer, more complex input texts. The model's fine-tuning on a high-quality dataset suggests it may be able to handle these types of inputs well, going beyond simple entity extraction to summarize key facts and relationships.

Another idea is to experiment with providing different levels of detail in the JSON template and example output to see how it affects the model's performance. This could help refine the template and instructions to get the most accurate extractions for your specific use case.

[](#sota-entity-recognition-multilingual-foundation-model-by-numind-)SOTA Entity Recognition Multilingual Foundation Model by NuMind 
=========================================================================================================================================

This model provides the best embedding for the Entity Recognition task and supports 9+ languages.

**Checkout other models by NuMind:**

*   SOTA Entity Recognition Foundation Model in English: [link](https://huggingface.co/numind/entity-recognition-general-sota-v1)
*   SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)

[](#about)About
---------------

[Multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) finetunned on an artificially annotated multilingual subset of [Oscar dataset](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201). This model provides domain & language independent embedding for Entity Recognition Task. We fine-tunned it only on 9 languages but the model can generalize over other languages that are supported by the Multilingual BERT.

**Metrics:**

Read more about evaluation protocol & datasets in our [blog post](https://www.numind.ai/blog/a-foundation-model-for-entity-recognition)

Model

F1 macro

bert-base-multilingual-cased

0.5206

ours

0.5892

ours + two emb

0.6231

[](#usage)Usage
---------------

Embeddings can be used out of the box or fine-tuned on specific datasets.

Get embeddings:

    import torch
    import transformers
    
    
    model = transformers.AutoModel.from_pretrained(
        'numind/NuNER-multilingual-v0.1',
        output_hidden_states=True,
    )
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        'numind/NuNER-multilingual-v0.1',
    )
    
    text = [
        "NuMind is an AI company based in Paris and USA.",
        "NuMind est une entreprise d'IA base  Paris et aux tats-Unis.",
        "See other models from us on https://huggingface.co/numind"
    ]
    encoded_input = tokenizer(
        text,
        return_tensors='pt',
        padding=True,
        truncation=True
    )
    output = model(**encoded_input)
    
    # two emb trick: for better quality
    emb = torch.cat(
        (output.hidden_states[-1], output.hidden_states[-7]),
        dim=2
    )
    
    # single emb: for better speed
    # emb = output.hidden_states[-1]
    

[](#citation)Citation
---------------------

    @misc{bogdanov2024nuner,
          title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data}, 
          author={Sergei Bogdanov and Alexandre Constantin and Timothe Bernard and Benoit Crabb and Etienne Bernard},
          year={2024},
          eprint={2402.15343},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }

## Model overview

The `NuNER-multilingual-v0.1` model is a powerful multilingual entity recognition foundation model developed by NuMind. It is built on top of the Multilingual BERT (mBERT) model and has been fine-tuned on an artificially annotated subset of the OSCAR dataset. This model provides domain and language-independent embeddings for the entity recognition task, supporting over 9 languages.

Compared to the base mBERT model, the `NuNER-multilingual-v0.1` model demonstrates superior performance, with an F1 macro score of 0.5892 versus 0.5206 for mBERT. Additionally, by using a "two emb trick" technique, the model's performance can be further improved to an F1 macro score of 0.6231.

## Model inputs and outputs

### Inputs
- Textual data in one of the supported languages

### Outputs
- Embeddings that can be used for downstream entity recognition tasks

## Capabilities

The `NuNER-multilingual-v0.1` model excels at providing high-quality embeddings for the entity recognition task, with the ability to generalize across different languages and domains. This makes it a valuable tool for a wide range of natural language processing applications, including named entity recognition, knowledge extraction, and information retrieval.

## What can I use it for?

The `NuNER-multilingual-v0.1` model can be leveraged in various use cases, such as:

- Developing multilingual information extraction systems
- Building knowledge graphs and knowledge bases from unstructured text
- Enhancing search and recommendation engines with entity-based features
- Improving chatbots and virtual assistants with better understanding of named entities

## Things to try

One interesting aspect of the `NuNER-multilingual-v0.1` model is the "two emb trick" technique, which can be used to improve the quality of the embeddings. By concatenating the hidden states from the last and second-to-last layers of the model, you can obtain embeddings with even better performance for your entity recognition tasks.

[](#entity-recognition-english-foundation-model-by-numind-)Entity Recognition English Foundation Model by NuMind 
=====================================================================================================================

This model provides great token embedding for the Entity Recognition task in English.

This is the prototype of the model from our [**Paper**](https://arxiv.org/abs/2402.15343): **NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data**

We suggest using **newer version of this model: [NuNER v1.0](https://huggingface.co/numind/NuNER-v1.0)** - this is the model reported in the paper.

**Checkout other models by NuMind:**

*   SOTA Multilingual Entity Recognition Foundation Model: [link](https://huggingface.co/numind/entity-recognition-multilingual-general-sota-v1)
*   SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)

[](#about)About
---------------

[Roberta-base](https://huggingface.co/roberta-base) fine-tuned on [NuNER data](https://huggingface.co/datasets/numind/NuNER).

**Metrics:**

Read more about evaluation protocol & datasets in our [paper](https://arxiv.org/abs/2402.15343) and [blog post](https://www.numind.ai/blog/a-foundation-model-for-entity-recognition).

Model

F1 macro

RoBERTa-base

0.7129

ours

0.7500

ours + two emb

0.7686

[](#usage)Usage
---------------

Embeddings can be used out of the box or fine-tuned on specific datasets.

Get embeddings:

    import torch
    import transformers
    
    
    model = transformers.AutoModel.from_pretrained(
        'numind/NuNER-v0.1',
        output_hidden_states=True
    )
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        'numind/NuNER-v0.1'
    )
    
    text = [
        "NuMind is an AI company based in Paris and USA.",
        "See other models from us on https://huggingface.co/numind"
    ]
    encoded_input = tokenizer(
        text,
        return_tensors='pt',
        padding=True,
        truncation=True
    )
    output = model(**encoded_input)
    
    # for better quality
    emb = torch.cat(
        (output.hidden_states[-1], output.hidden_states[-7]),
        dim=2
    )
    
    # for better speed
    # emb = output.hidden_states[-1]
    

[](#citation)Citation
---------------------

    @misc{bogdanov2024nuner,
          title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data}, 
          author={Sergei Bogdanov and Alexandre Constantin and Timothe Bernard and Benoit Crabb and Etienne Bernard},
          year={2024},
          eprint={2402.15343},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }

## Model overview

The `NuNER-v0.1` model is an English language entity recognition model fine-tuned from the RoBERTa-base model by the team at [NuMind](https://aimodels.fyi/creators/huggingFace/numind). This model provides strong token embeddings for entity recognition tasks in English. It was the prototype for the [NuNER v1.0 model](https://huggingface.co/numind/NuNER-v1.0), which is the version reported in the [paper](https://arxiv.org/abs/2402.15343) introducing the model.

The `NuNER-v0.1` model outperforms the base RoBERTa-base model on entity recognition, achieving an F1 macro score of 0.7500 compared to 0.7129 for RoBERTa-base. Combining the last and second-to-last hidden states further improves performance to 0.7686 F1 macro.

Other notable entity recognition models include [bert-base-NER](https://aimodels.fyi/models/huggingFace/bert-base-ner-dslim), a BERT-base model fine-tuned on the CoNLL-2003 dataset, and [roberta-large-ner-english](https://aimodels.fyi/models/huggingFace/roberta-large-ner-english-jean-baptiste), a RoBERTa-large model fine-tuned for English NER.

## Model inputs and outputs

### Inputs
- **Text**: The model takes in raw text as input, which it then tokenizes and encodes for processing.

### Outputs
- **Entity predictions**: The model outputs a sequence of entity predictions for the input text, classifying each token as belonging to one of the four entity types: location (LOC), organization (ORG), person (PER), or miscellaneous (MISC).
- **Token embeddings**: The model can also be used to extract token-level embeddings, which can be useful for downstream tasks. The author suggests using the concatenation of the last and second-to-last hidden states for better quality embeddings.

## Capabilities

The `NuNER-v0.1` model is highly capable at recognizing entities in English text, surpassing the base RoBERTa model on the CoNLL-2003 NER dataset. It can accurately identify locations, organizations, people, and miscellaneous entities within input text. This makes it a powerful tool for applications that require understanding the entities mentioned in documents, such as information extraction, knowledge graph construction, or content analysis.

## What can I use it for?

The `NuNER-v0.1` model can be used for a variety of applications that involve identifying and extracting entities from English text. Some potential use cases include:

- **Information Extraction**: The model can be used to automatically extract key entities (people, organizations, locations, etc.) from documents, articles, or other text-based data sources.
- **Knowledge Graph Construction**: The entity predictions from the model can be used to populate a knowledge graph with structured information about the entities mentioned in a corpus.
- **Content Analysis**: By understanding the entities present in text, the model can enable more sophisticated content analysis tasks, such as topic modeling, sentiment analysis, or text summarization.
- **Chatbots and Virtual Assistants**: The entity recognition capabilities of the model can be leveraged to improve the natural language understanding of chatbots and virtual assistants, allowing them to better comprehend user queries and respond appropriately.

## Things to try

One interesting aspect of the `NuNER-v0.1` model is its ability to produce high-quality token embeddings by concatenating the last and second-to-last hidden states. These embeddings could be used as input features for a wide range of downstream NLP tasks, such as text classification, named entity recognition, or relation extraction. Experimenting with different ways of utilizing these embeddings, such as fine-tuning on domain-specific datasets or combining them with other model architectures, could lead to exciting new applications and performance improvements.

Another avenue to explore would be comparing the `NuNER-v0.1` model's performance on different types of text data, beyond the news-based CoNLL-2003 dataset used for evaluation. Trying the model on more informal, conversational text (e.g., social media, emails, chat logs) could uncover interesting insights about its generalization capabilities and potential areas for improvement.