[](#nemotron-4-340b-instruct)Nemotron-4-340B-Instruct
-----------------------------------------------------

[![Model architecture](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-340B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets)

### [](#model-overview)Model Overview

Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The base model was pre-trained on a corpus of 9 trillion tokens consisting of a diverse assortment of English based texts, 50+ natural languages, and 40+ coding languages. Subsequently the Nemotron-4-340B-Instruct model went through additional alignment steps including:

*   Supervised Fine-tuning (SFT)
*   Direct Preference Optimization (DPO)
*   Reward-aware Preference Optimization (RPO) ([Additional in-house alignment technique](https://research.nvidia.com/publication/2024-06_nemotron-4-340b))

Throughout the alignment process, we relied on only approximately 20K human-annotated data while our data generation pipeline synthesized over 98% of the data used for supervised fine-tuning and preference fine-tuning (DPO & RPO). We provide comprehensive details about our synthetic data generation pipeline in the [technical report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b).

This results in a model that is aligned for human chat preferences, improvements in mathematical reasoning, coding and instruction-following, and is capable of generating high quality synthetic data for a variety of use cases.

Under the NVIDIA Open Model License, NVIDIA confirms:

*   Models are commercially usable.
*   You are free to create and distribute Derivative Models.
*   NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

### [](#license)License:

[NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf)

### [](#intended-use)Intended use

Nemotron-4-340B-Instruct is a chat model intended for use for the English language.

Nemotron-4-340B-Instruct is designed for Synthetic Data Generation to enable developers and enterprises for building and customizing their own large language models and LLM applications.

The instruct model itself can be further customized using the [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html) suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner).

**Model Developer:** NVIDIA

**Model Dates:** Nemotron-4-340B-Instruct was trained between December 2023 and May 2024.

**Data Freshness:** The pretraining data has a cutoff of June 2023.

### [](#required-hardware)Required Hardware

BF16 Inference:

*   8x H200 (1x H200 node)
*   16x H100 (2x H100 nodes)
*   16x A100 80GB (2x A100 80GB nodes)

### [](#model-architecture)Model Architecture:

Nemotron-4-340B-Instruct is standard decoder-only Transformer, trained with a sequence length of 4096 tokens, uses Grouped-Query Attention (GQA), and Rotary Position Embeddings (RoPE).

**Architecture Type:** Transformer Decoder (auto-regressive language model)

**Network Architecture:** Nemotron-4

### [](#prompt-format)Prompt Format

Note: For Nemotron-4-340B-Instruct we recommend keeping the system prompt empty.

#### [](#single-turn)Single Turn

    <extra_id_0>System
    
    <extra_id_1>User
    {prompt}
    <extra_id_1>Assistant
    

#### [](#multi-turn-or-few-shot)Multi-Turn or Few-shot

    <extra_id_0>System
    
    <extra_id_1>User
    {prompt 1}
    <extra_id_1>Assistant
    {response 1}
    <extra_id_1>User
    {prompt 2}
    <extra_id_1>Assistant
    {response 2}
    ...
    <extra_id_1>User
    {prompt N}
    <extra_id_1>Assistant
    

An example of a formattable prompt template is available in the following section.

### [](#usage)Usage

Deployment and inference with Nemotron-4-340B-Instruct can be done in three steps using NeMo Framework:

Create a Python script to interact with the deployed model. Create a Bash script to start the inference server Schedule a Slurm job to distribute the model across 2 nodes and associate them with the inference server.

1.  Define the Python script `call_server.py`

    import json
    import requests
    
    headers = {"Content-Type": "application/json"}
    
    def text_generation(data, ip='localhost', port=None):
        resp = requests.put(f'http://{ip}:{port}/generate', data=json.dumps(data), headers=headers)
        return resp.json()
    
    
    def get_generation(prompt, greedy, add_BOS, token_to_gen, min_tokens, temp, top_p, top_k, repetition, batch=False):
        data = {
            "sentences": [prompt] if not batch else prompt,
            "tokens_to_generate": int(token_to_gen),
            "temperature": temp,
            "add_BOS": add_BOS,
            "top_k": top_k,
            "top_p": top_p,
            "greedy": greedy,
            "all_probs": False,
            "repetition_penalty": repetition,
            "min_tokens_to_generate": int(min_tokens),
            "end_strings": ["<|endoftext|>", "<extra_id_1>", "\x11", "<extra_id_1>User"],
        }
        sentences = text_generation(data, port=1424)['sentences']
        return sentences[0] if not batch else sentences
    
    PROMPT_TEMPLATE = """<extra_id_0>System
    
    <extra_id_1>User
    {prompt}
    <extra_id_1>Assistant
    """
    
    question = "Write a poem on NVIDIA in the style of Shakespeare"
    prompt = PROMPT_TEMPLATE.format(prompt=question)
    print(prompt)
    
    response = get_generation(prompt, greedy=True, add_BOS=False, token_to_gen=1024, min_tokens=1, temp=1.0, top_p=1.0, top_k=0, repetition=1.0, batch=False)
    response = response[len(prompt):]
    if response.endswith("<extra_id_1>"):
        response = response[:-len("<extra_id_1>")]
    print(response)
    

2.  Given this Python script, create a Bash script which spins up the inference server within the [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) (`docker pull nvcr.io/nvidia/nemo:24.01.framework`) and calls the Python script `call_server.py`. The Bash script `nemo_inference.sh` is as follows,

    NEMO_FILE=$1
    WEB_PORT=1424
    
    depends_on () {
        HOST=$1
        PORT=$2
        STATUS=$(curl -X PUT http://$HOST:$PORT >/dev/null 2>/dev/null; echo $?)
        while [ $STATUS -ne 0 ]
        do
             echo "waiting for server ($HOST:$PORT) to be up"
             sleep 10
             STATUS=$(curl -X PUT http://$HOST:$PORT >/dev/null 2>/dev/null; echo $?)
        done
        echo "server ($HOST:$PORT) is up running"
    }
    
    
    /usr/bin/python3 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \
            gpt_model_file=$NEMO_FILE \
            pipeline_model_parallel_split_rank=0 \
            server=True tensor_model_parallel_size=8 \
            trainer.precision=bf16 pipeline_model_parallel_size=2 \
            trainer.devices=8 \
            trainer.num_nodes=2 \
            web_server=False \
            port=${WEB_PORT} &
        SERVER_PID=$!
    
        readonly local_rank="${LOCAL_RANK:=${SLURM_LOCALID:=${OMPI_COMM_WORLD_LOCAL_RANK:-}}}"
        if [ $SLURM_NODEID -eq 0 ] && [ $local_rank -eq 0 ]; then
            depends_on "0.0.0.0" ${WEB_PORT}
    
            echo "start get json"
            sleep 5
    
            echo "SLURM_NODEID: $SLURM_NODEID"
            echo "local_rank: $local_rank"
            /usr/bin/python3 /scripts/call_server.py
            echo "clean up dameons: $$"
            kill -9 $SERVER_PID
            pkill python
        fi
        wait
    

3.  Launch `nemo_inference.sh` with a Slurm script defined like below, which starts a 2-node job for model inference.

    #!/bin/bash
    #SBATCH -A SLURM-ACCOUNT
    #SBATCH -p SLURM-PARITION
    #SBATCH -N 2
    #SBATCH -J generation      
    #SBATCH --ntasks-per-node=8   
    #SBATCH --gpus-per-node=8
    set -x
    
    RESULTS=<PATH_TO_YOUR_SCRIPTS_FOLDER>
    OUTFILE="${RESULTS}/slurm-%j-%n.out"
    ERRFILE="${RESULTS}/error-%j-%n.out"
    MODEL=<PATH_TO>/Nemotron-4-340B-Instruct
    CONTAINER="nvcr.io/nvidia/nemo:24.01.framework"
    MOUNTS="--container-mounts=<PATH_TO_YOUR_SCRIPTS_FOLDER>:/scripts,MODEL:/model"
    
    read -r -d '' cmd <<EOF
    bash /scripts/nemo_inference.sh /model
    EOF
    
    srun -o $OUTFILE -e $ERRFILE --container-image="$CONTAINER" $MOUNTS bash -c "${cmd}"
    

### [](#evaluation-results)Evaluation Results

#### [](#mt-bench-gpt-4-turbo)MT-Bench (GPT-4-Turbo)

Evaluated using MT-Bench judging by GPT-4-0125-Preview as described in Appendix H in the [HelpSteer2 Dataset Paper](https://arxiv.org/abs/2406.08673)

total

writing

roleplay

extraction

stem

humanities

reasoning

math

coding

turn 1

turn 2

8.22

8.70

8.70

9.20

8.75

8.95

6.40

8.40

6.70

8.61

7.84

#### [](#ifeval)IFEval

Evaluated using the Instruction Following Eval (IFEval) introduced in Instruction-Following Evaluation for Large Language Models.

Prompt-Strict Acc

Instruction-Strict Acc

79.9

86.1

#### [](#mmlu)MMLU

Evaluated using the Multi-task Language Understanding benchmarks as introduced in Measuring Massive Multitask Language Understanding.

MMLU 0-shot

78.7

#### [](#gsm8k)GSM8K

Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in Training Verifiers to Solve Math Word Problems.

GSM8K 0-shot

92.3

#### [](#humaneval)HumanEval

Evaluated using the HumanEval benchmark as introduced in Evaluating Large Language Models Trained on Code.

HumanEval 0-shot

73.2

#### [](#mbpp)MBPP

Evaluated using the MBPP Dataset as introduced in the Program Synthesis with Large Language Models.

MBPP 0-shot

75.4

#### [](#arena-hard)Arena Hard

Evaluated using the Arena-Hard Pipeline from the LMSys Org.

Arena Hard

54.2

#### [](#alpacaeval-20-lc)AlpacaEval 2.0 LC

Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the paper: Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

AlpacaEval 2.0 LC

41.5

#### [](#tfeval)TFEval

Evaluated using the CantTalkAboutThis Dataset as introduced in the CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues.

Distractor F1

On-topic F1

81.7

97.7

### [](#adversarial-testing-and-red-teaming-efforts)Adversarial Testing and Red Teaming Efforts

The Nemotron-4 340B-Instruct model underwent safety evaluation including adversarial testing via three distinct methods:

*   [Garak](https://docs.garak.ai/garak), is an automated LLM vulnerability scanner that probes for common weaknesses, including prompt injection and data leakage.
*   AEGIS, is a content safety evaluation dataset and LLM based content safety classifier model, that adheres to a broad taxonomy of 13 categories of critical risks in human-LLM interactions.
*   Human Content Red Teaming leveraging human interaction and evaluation of the models' responses.

### [](#limitations)Limitations

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

### [](#ethical-considerations)Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct). Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

## Model Overview

The `Nemotron-4-340B-Instruct` is a large language model (LLM) developed by NVIDIA. It is a fine-tuned version of the [Nemotron-4-340B-Base](https://aimodels.fyi/models/huggingFace/nemotron-4-340b-base-nvidia) model, optimized for English-based single and multi-turn chat use-cases. The model has 340 billion parameters and supports a context length of 4,096 tokens.

The Nemotron-4-340B-Instruct model was trained on a diverse corpus of 9 trillion tokens, including English-based texts, 50+ natural languages, and 40+ coding languages. It then went through additional alignment steps, including supervised fine-tuning (SFT), direct preference optimization (DPO), and reward-aware preference optimization (RPO), using approximately 20K human-annotated data. This results in a model that is aligned for human chat preferences, improvements in mathematical reasoning, coding, and instruction-following, and is capable of generating high quality synthetic data for a variety of use cases.

## Model Inputs and Outputs

### Inputs
- **Text**: The Nemotron-4-340B-Instruct model takes natural language text as input, typically in the form of prompts or conversational exchanges.

### Outputs
- **Text**: The model generates natural language text as output, which can include responses to prompts, continuations of conversations, or synthetic data.

## Capabilities

The `Nemotron-4-340B-Instruct` model can be used for a variety of natural language processing tasks, including:

- **Chat and Conversation**: The model is optimized for English-based single and multi-turn chat use-cases, and can engage in coherent and helpful conversations.
- **Instruction-Following**: The model can understand and follow instructions, making it useful for task-oriented applications.
- **Mathematical Reasoning**: The model has improved capabilities in mathematical reasoning, which can be useful for educational or analytical applications.
- **Code Generation**: The model's training on coding languages allows it to generate high-quality code, making it suitable for developer assistance or programming-related tasks.
- **Synthetic Data Generation**: The model's alignment and optimization process makes it well-suited for generating high-quality synthetic data, which can be used to train other language models.

## What Can I Use It For?

The `Nemotron-4-340B-Instruct` model can be used for a wide range of applications, particularly those that require natural language understanding, generation, and task-oriented capabilities. Some potential use cases include:

- **Chatbots and Virtual Assistants**: The model can be used to build conversational AI agents that can engage in helpful and coherent dialogues.
- **Educational and Tutoring Applications**: The model's capabilities in mathematical reasoning and instruction-following can be leveraged to create educational tools and virtual tutors.
- **Developer Assistance**: The model's ability to generate high-quality code can be used to build tools that assist software developers with programming-related tasks.
- **Synthetic Data Generation**: Companies and researchers can use the model to generate high-quality synthetic data for training their own language models, as described in the [technical report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b).

## Things to Try

One interesting aspect of the `Nemotron-4-340B-Instruct` model is its ability to follow instructions and engage in task-oriented dialogue. You could try prompting the model with open-ended questions or requests, and observe how it responds and adapts to the task at hand. For example, you could ask the model to write a short story, solve a math problem, or provide step-by-step instructions for a particular task, and see how it performs.

Another interesting area to explore would be the model's capabilities in generating synthetic data. You could experiment with different prompts or techniques to guide the model's data generation, and then assess the quality and usefulness of the generated samples for training your own language models.

[](#model-details)Model Details
-------------------------------

We introduce Llama3-ChatQA-1.5, which excels at conversational question answering (QA) and retrieval-augmented generation (RAG). Llama3-ChatQA-1.5 is developed using an improved training recipe from [ChatQA (1.0)](https://arxiv.org/abs/2401.10225), and it is built on top of [Llama-3 base model](https://huggingface.co/meta-llama/Meta-Llama-3-8B). Specifically, we incorporate more conversational QA data to enhance its tabular and arithmetic calculation capability. Llama3-ChatQA-1.5 has two variants: Llama3-ChatQA-1.5-8B and Llama3-ChatQA-1.5-70B. Both models were originally trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), we converted the checkpoints to Hugging Face format.

[](#other-resources)Other Resources
-----------------------------------

[Llama3-ChatQA-1.5-70B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B)  [Evaluation Data](https://huggingface.co/datasets/nvidia/ConvRAG-Bench)  [Training Data](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data)  [Retriever](https://huggingface.co/nvidia/dragon-multiturn-query-encoder)

[](#benchmark-results)Benchmark Results
---------------------------------------

Results in ConvRAG Bench are as follows:

ChatQA-1.0-7B

Command-R-Plus

Llama-3-instruct-70b

GPT-4-0613

ChatQA-1.0-70B

ChatQA-1.5-8B

ChatQA-1.5-70B

Doc2Dial

37.88

33.51

37.88

34.16

38.9

39.33

41.26

QuAC

29.69

34.16

36.96

40.29

41.82

39.73

38.82

QReCC

46.97

49.77

51.34

52.01

48.05

49.03

51.40

CoQA

76.61

69.71

76.98

77.42

78.57

76.46

78.44

DoQA

41.57

40.67

41.24

43.39

51.94

49.6

50.67

ConvFinQA

51.61

71.21

76.6

81.28

73.69

78.46

81.88

SQA

61.87

74.07

69.61

79.21

69.14

73.28

83.82

TopioCQA

45.45

53.77

49.72

45.09

50.98

49.96

55.63

HybriDial\*

54.51

46.7

48.59

49.81

56.44

65.76

68.27

INSCIT

30.96

35.76

36.23

36.34

31.9

30.1

32.31

Average (all)

47.71

50.93

52.52

53.90

54.14

55.17

58.25

Average (exclude HybriDial)

46.96

51.40

52.95

54.35

53.89

53.99

57.14

Note that ChatQA-1.5 is built based on Llama-3 base model, and ChatQA-1.0 is built based on Llama-2 base model. ChatQA-1.5 used some samples from the HybriDial training dataset. To ensure fair comparison, we also compare average scores excluding HybriDial. The data and evaluation scripts for ConvRAG can be found [here](https://huggingface.co/datasets/nvidia/ConvRAG-Bench).

[](#prompt-format)Prompt Format
-------------------------------

System: {System}

{Context}

User: {Question}

Assistant: {Response}

User: {Question}

Assistant:

[](#how-to-use)How to use
-------------------------

### [](#take-the-whole-document-as-context)take the whole document as context

This can be applied to the scenario where the whole document can be fitted into the model, so that there is no need to run retrieval over the document.

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch
    
    model_id = "nvidia/Llama3-ChatQA-1.5-8B"
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
    
    messages = [
        {"role": "user", "content": "what is the percentage change of the net income from Q4 FY23 to Q4 FY24?"}
    ]
    
    document = """NVIDIA (NASDAQ: NVDA) today reported revenue for the fourth quarter ended January 28, 2024, of $22.1 billion, up 22% from the previous quarter and up 265% from a year ago.\nFor the quarter, GAAP earnings per diluted share was $4.93, up 33% from the previous quarter and up 765% from a year ago. Non-GAAP earnings per diluted share was $5.16, up 28% from the previous quarter and up 486% from a year ago.\nQ4 Fiscal 2024 Summary\nGAAP\n| $ in millions, except earnings per share | Q4 FY24 | Q3 FY24 | Q4 FY23 | Q/Q | Y/Y |\n| Revenue | $22,103 | $18,120 | $6,051 | Up 22% | Up 265% |\n| Gross margin | 76.0% | 74.0% | 63.3% | Up 2.0 pts | Up 12.7 pts |\n| Operating expenses | $3,176 | $2,983 | $2,576 | Up 6% | Up 23% |\n| Operating income | $13,615 | $10,417 | $1,257 | Up 31% | Up 983% |\n| Net income | $12,285 | $9,243 | $1,414 | Up 33% | Up 769% |\n| Diluted earnings per share | $4.93 | $3.71 | $0.57 | Up 33% | Up 765% |"""
    
    def get_formatted_input(messages, context):
        system = "System: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."
        instruction = "Please give a full and complete answer for the question."
    
        for item in messages:
            if item['role'] == "user":
                ## only apply this instruction for the first user turn
                item['content'] = instruction + " " + item['content']
                break
    
        conversation = '\n\n'.join(["User: " + item["content"] if item["role"] == "user" else "Assistant: " + item["content"] for item in messages]) + "\n\nAssistant:"
        formatted_input = system + "\n\n" + context + "\n\n" + conversation
        
        return formatted_input
    
    formatted_input = get_formatted_input(messages, document)
    tokenized_prompt = tokenizer(tokenizer.bos_token + formatted_input, return_tensors="pt").to(model.device)
    
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    
    outputs = model.generate(input_ids=tokenized_prompt.input_ids, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=128, eos_token_id=terminators)
    
    response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
    print(tokenizer.decode(response, skip_special_tokens=True))
    

### [](#run-retrieval-to-get-top-n-chunks-as-context)run retrieval to get top-n chunks as context

This can be applied to the scenario when the document is very long, so that it is necessary to run retrieval. Here, we use our [Dragon-multiturn](https://huggingface.co/nvidia/dragon-multiturn-query-encoder) retriever which can handle conversatinoal query. In addition, we provide a few [documents](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-8B/tree/main/docs) for users to play with.

    from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel
    import torch
    import json
    
    ## load ChatQA-1.5 tokenizer and model
    model_id = "nvidia/Llama3-ChatQA-1.5-8B"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
    
    ## load retriever tokenizer and model
    retriever_tokenizer = AutoTokenizer.from_pretrained('nvidia/dragon-multiturn-query-encoder')
    query_encoder = AutoModel.from_pretrained('nvidia/dragon-multiturn-query-encoder')
    context_encoder = AutoModel.from_pretrained('nvidia/dragon-multiturn-context-encoder')
    
    ## prepare documents, we take landrover car manual document that we provide as an example
    chunk_list = json.load(open("docs.json"))['landrover']
    
    messages = [
        {"role": "user", "content": "how to connect the bluetooth in the car?"}
    ]
    
    ### running retrieval
    ## convert query into a format as follows:
    ## user: {user}\nagent: {agent}\nuser: {user}
    formatted_query_for_retriever = '\n'.join([turn['role'] + ": " + turn['content'] for turn in messages]).strip()
    
    query_input = retriever_tokenizer(formatted_query_for_retriever, return_tensors='pt')
    ctx_input = retriever_tokenizer(chunk_list, padding=True, truncation=True, max_length=512, return_tensors='pt')
    query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :]
    ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :]
    
    ## Compute similarity scores using dot product and rank the similarity
    similarities = query_emb.matmul(ctx_emb.transpose(0, 1)) # (1, num_ctx)
    ranked_results = torch.argsort(similarities, dim=-1, descending=True) # (1, num_ctx)
    
    ## get top-n chunks (n=5)
    retrieved_chunks = [chunk_list[idx] for idx in ranked_results.tolist()[0][:5]]
    context = "\n\n".join(retrieved_chunks)
    
    ### running text generation
    formatted_input = get_formatted_input(messages, context)
    tokenized_prompt = tokenizer(tokenizer.bos_token + formatted_input, return_tensors="pt").to(model.device)
    
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    outputs = model.generate(input_ids=tokenized_prompt.input_ids, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=128, eos_token_id=terminators)
    
    response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
    print(tokenizer.decode(response, skip_special_tokens=True))
    

[](#correspondence-to)Correspondence to
---------------------------------------

Zihan Liu ([zihanl@nvidia.com](mailto:zihanl@nvidia.com)), Wei Ping ([wping@nvidia.com](mailto:wping@nvidia.com))

[](#citation)Citation
---------------------

@article{liu2024chatqa,
  title={ChatQA: Building GPT-4 Level Conversational QA Models},
  author={Liu, Zihan and Ping, Wei and Roy, Rajarshi and Xu, Peng and Lee, Chankyu and Shoeybi, Mohammad and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2401.10225},
  year={2024}}

[](#license)License
-------------------

The use of this model is governed by the [META LLAMA 3 COMMUNITY LICENSE AGREEMENT](https://llama.meta.com/llama3/license/)

## Model overview

The `Llama3-ChatQA-1.5-8B` model is a large language model developed by [NVIDIA](https://aimodels.fyi/creators/huggingFace/nvidia) that excels at conversational question answering (QA) and retrieval-augmented generation (RAG). It was built on top of the [Llama-3 base model](https://huggingface.co/meta-llama/Meta-Llama-3-8B) and incorporates more conversational QA data to enhance its tabular and arithmetic calculation capabilities. There is also a larger 70B parameter version available.

## Model inputs and outputs

### Inputs
- **Text**: The model accepts text input to engage in conversational question answering and generation tasks.

### Outputs
- **Text**: The model outputs generated text responses, providing answers to questions and generating relevant information.

## Capabilities

The `Llama3-ChatQA-1.5-8B` model demonstrates strong performance on a variety of conversational QA and RAG benchmarks, outperforming models like [ChatQA-1.0-7B](https://huggingface.co/nvidia/ChatQA-1.0-7B), [Llama-3-instruct-70b](https://huggingface.co/meta-llama/Meta-Llama-3-70B), and [GPT-4-0613](https://openai.com/product/gpt-4). It excels at tasks like document-grounded dialogue, multi-turn question answering, and open-ended conversational QA.

## What can I use it for?

The `Llama3-ChatQA-1.5-8B` model is well-suited for building conversational AI assistants, chatbots, and other applications that require natural language understanding and generation capabilities. It could be used to power customer service chatbots, virtual assistants, educational tools, and more. The model's strong performance on QA and RAG tasks make it a valuable resource for researchers and developers working on conversational AI systems.

## Things to try

One interesting aspect of the `Llama3-ChatQA-1.5-8B` model is its ability to handle tabular and arithmetic calculation tasks, which can be useful for applications that require quantitative reasoning. Developers could explore using the model to power conversational interfaces for data analysis, financial planning, or other domains that involve numerical information.

Another interesting area to explore would be the model's performance on multi-turn dialogues and its ability to maintain context and coherence over the course of a conversation. Developers could experiment with using the model for open-ended chatting, task-oriented dialogues, or other interactive scenarios to further understand its conversational capabilities.

[](#model-details)Model Details
-------------------------------

We introduce Llama3-ChatQA-1.5, which excels at conversational question answering (QA) and retrieval-augmented generation (RAG). Llama3-ChatQA-1.5 is developed using an improved training recipe from [ChatQA (1.0)](https://arxiv.org/abs/2401.10225), and it is built on top of [Llama-3 base model](https://huggingface.co/meta-llama/Meta-Llama-3-8B). Specifically, we incorporate more conversational QA data to enhance its tabular and arithmetic calculation capability. Llama3-ChatQA-1.5 has two variants: Llama3-ChatQA-1.5-8B and Llama3-ChatQA-1.5-70B. Both models were originally trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), we converted the checkpoints to Hugging Face format.

[](#other-resources)Other Resources
-----------------------------------

[Llama3-ChatQA-1.5-8B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-8B)  [Evaluation Data](https://huggingface.co/datasets/nvidia/ConvRAG-Bench)  [Training Data](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data)  [Retriever](https://huggingface.co/nvidia/dragon-multiturn-query-encoder)

[](#benchmark-results)Benchmark Results
---------------------------------------

Results in ConvRAG Bench are as follows:

ChatQA-1.0-7B

Command-R-Plus

Llama-3-instruct-70b

GPT-4-0613

ChatQA-1.0-70B

ChatQA-1.5-8B

ChatQA-1.5-70B

Doc2Dial

37.88

33.51

37.88

34.16

38.9

39.33

41.26

QuAC

29.69

34.16

36.96

40.29

41.82

39.73

38.82

QReCC

46.97

49.77

51.34

52.01

48.05

49.03

51.40

CoQA

76.61

69.71

76.98

77.42

78.57

76.46

78.44

DoQA

41.57

40.67

41.24

43.39

51.94

49.6

50.67

ConvFinQA

51.61

71.21

76.6

81.28

73.69

78.46

81.88

SQA

61.87

74.07

69.61

79.21

69.14

73.28

83.82

TopioCQA

45.45

53.77

49.72

45.09

50.98

49.96

55.63

HybriDial\*

54.51

46.7

48.59

49.81

56.44

65.76

68.27

INSCIT

30.96

35.76

36.23

36.34

31.9

30.1

32.31

Average (all)

47.71

50.93

52.52

53.90

54.14

55.17

58.25

Average (exclude HybriDial)

46.96

51.40

52.95

54.35

53.89

53.99

57.14

Note that ChatQA-1.5 is built based on Llama-3 base model, and ChatQA-1.0 is built based on Llama-2 base model. ChatQA-1.5 used some samples from the HybriDial training dataset. To ensure fair comparison, we also compare average scores excluding HybriDial. The data and evaluation scripts for ConvRAG can be found [here](https://huggingface.co/datasets/nvidia/ConvRAG-Bench).

[](#prompt-format)Prompt Format
-------------------------------

System: {System}

{Context}

User: {Question}

Assistant: {Response}

User: {Question}

Assistant:

[](#how-to-use)How to use
-------------------------

### [](#take-the-whole-document-as-context)take the whole document as context

This can be applied to the scenario where the whole document can be fitted into the model, so that there is no need to run retrieval over the document.

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch
    
    model_id = "nvidia/Llama3-ChatQA-1.5-70B"
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
    
    messages = [
        {"role": "user", "content": "what is the percentage change of the net income from Q4 FY23 to Q4 FY24?"}
    ]
    
    document = """NVIDIA (NASDAQ: NVDA) today reported revenue for the fourth quarter ended January 28, 2024, of $22.1 billion, up 22% from the previous quarter and up 265% from a year ago.\nFor the quarter, GAAP earnings per diluted share was $4.93, up 33% from the previous quarter and up 765% from a year ago. Non-GAAP earnings per diluted share was $5.16, up 28% from the previous quarter and up 486% from a year ago.\nQ4 Fiscal 2024 Summary\nGAAP\n| $ in millions, except earnings per share | Q4 FY24 | Q3 FY24 | Q4 FY23 | Q/Q | Y/Y |\n| Revenue | $22,103 | $18,120 | $6,051 | Up 22% | Up 265% |\n| Gross margin | 76.0% | 74.0% | 63.3% | Up 2.0 pts | Up 12.7 pts |\n| Operating expenses | $3,176 | $2,983 | $2,576 | Up 6% | Up 23% |\n| Operating income | $13,615 | $10,417 | $1,257 | Up 31% | Up 983% |\n| Net income | $12,285 | $9,243 | $1,414 | Up 33% | Up 769% |\n| Diluted earnings per share | $4.93 | $3.71 | $0.57 | Up 33% | Up 765% |"""
    
    def get_formatted_input(messages, context):
        system = "System: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."
        instruction = "Please give a full and complete answer for the question."
    
        for item in messages:
            if item['role'] == "user":
                ## only apply this instruction for the first user turn
                item['content'] = instruction + " " + item['content']
                break
    
        conversation = '\n\n'.join(["User: " + item["content"] if item["role"] == "user" else "Assistant: " + item["content"] for item in messages]) + "\n\nAssistant:"
        formatted_input = system + "\n\n" + context + "\n\n" + conversation
        
        return formatted_input
    
    formatted_input = get_formatted_input(messages, document)
    tokenized_prompt = tokenizer(tokenizer.bos_token + formatted_input, return_tensors="pt").to(model.device)
    
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    
    outputs = model.generate(input_ids=tokenized_prompt.input_ids, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=128, eos_token_id=terminators)
    
    response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
    print(tokenizer.decode(response, skip_special_tokens=True))
    

### [](#run-retrieval-to-get-top-n-chunks-as-context)run retrieval to get top-n chunks as context

This can be applied to the scenario when the document is very long, so that it is necessary to run retrieval. Here, we use our [Dragon-multiturn](https://huggingface.co/nvidia/dragon-multiturn-query-encoder) retriever which can handle conversatinoal query. In addition, we provide a few [documents](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B/tree/main/docs) for users to play with.

    from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel
    import torch
    import json
    
    ## load ChatQA-1.5 tokenizer and model
    model_id = "nvidia/Llama3-ChatQA-1.5-70B"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
    
    ## load retriever tokenizer and model
    retriever_tokenizer = AutoTokenizer.from_pretrained('nvidia/dragon-multiturn-query-encoder')
    query_encoder = AutoModel.from_pretrained('nvidia/dragon-multiturn-query-encoder')
    context_encoder = AutoModel.from_pretrained('nvidia/dragon-multiturn-context-encoder')
    
    ## prepare documents, we take landrover car manual document that we provide as an example
    chunk_list = json.load(open("docs.json"))['landrover']
    
    messages = [
        {"role": "user", "content": "how to connect the bluetooth in the car?"}
    ]
    
    ### running retrieval
    ## convert query into a format as follows:
    ## user: {user}\nagent: {agent}\nuser: {user}
    formatted_query_for_retriever = '\n'.join([turn['role'] + ": " + turn['content'] for turn in messages]).strip()
    
    query_input = retriever_tokenizer(formatted_query_for_retriever, return_tensors='pt')
    ctx_input = retriever_tokenizer(chunk_list, padding=True, truncation=True, max_length=512, return_tensors='pt')
    query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :]
    ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :]
    
    ## Compute similarity scores using dot product and rank the similarity
    similarities = query_emb.matmul(ctx_emb.transpose(0, 1)) # (1, num_ctx)
    ranked_results = torch.argsort(similarities, dim=-1, descending=True) # (1, num_ctx)
    
    ## get top-n chunks (n=5)
    retrieved_chunks = [chunk_list[idx] for idx in ranked_results.tolist()[0][:5]]
    context = "\n\n".join(retrieved_chunks)
    
    ### running text generation
    formatted_input = get_formatted_input(messages, context)
    tokenized_prompt = tokenizer(tokenizer.bos_token + formatted_input, return_tensors="pt").to(model.device)
    
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    outputs = model.generate(input_ids=tokenized_prompt.input_ids, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=128, eos_token_id=terminators)
    
    response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
    print(tokenizer.decode(response, skip_special_tokens=True))
    

[](#correspondence-to)Correspondence to
---------------------------------------

Zihan Liu ([zihanl@nvidia.com](mailto:zihanl@nvidia.com)), Wei Ping ([wping@nvidia.com](mailto:wping@nvidia.com))

[](#citation)Citation
---------------------

@article{liu2024chatqa,
  title={ChatQA: Building GPT-4 Level Conversational QA Models},
  author={Liu, Zihan and Ping, Wei and Roy, Rajarshi and Xu, Peng and Lee, Chankyu and Shoeybi, Mohammad and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2401.10225},
  year={2024}}

[](#license)License
-------------------

The use of this model is governed by the [META LLAMA 3 COMMUNITY LICENSE AGREEMENT](https://llama.meta.com/llama3/license/)

## Model Overview

The `Llama3-ChatQA-1.5-70B` model is a large language model developed by [NVIDIA](https://aimodels.fyi/creators/huggingFace/nvidia) that excels at conversational question answering (QA) and retrieval-augmented generation (RAG). It is built on top of the [Llama-3 base model](https://huggingface.co/meta-llama/Meta-Llama-3-8B) and incorporates more conversational QA data to enhance its tabular and arithmetic calculation capability. The model comes in two variants: `Llama3-ChatQA-1.5-8B` and `Llama3-ChatQA-1.5-70B`. Both models were originally trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and then converted to the Hugging Face format.

## Model Inputs and Outputs

### Inputs
- **Text**: The model takes text as input, which can be in the form of a conversation or a question.

### Outputs
- **Text**: The model generates text as output, providing answers to questions or continuing a conversation.

## Capabilities

The `Llama3-ChatQA-1.5-70B` model excels at conversational question answering and retrieval-augmented generation tasks. It has demonstrated strong performance on benchmarks such as ConvRAG, QuAC, QReCC, and ConvFinQA, outperforming other models like ChatQA-1.0-7B, Command-R-Plus, and Llama-3-instruct-70b.

## What can I use it for?

The `Llama3-ChatQA-1.5-70B` model can be used for a variety of applications that involve question answering and conversational abilities, such as:

- Building intelligent chatbots or virtual assistants
- Enhancing search engines with more advanced query understanding and response generation
- Developing educational tools and tutoring systems
- Automating customer service and support interactions
- Assisting in research and analysis tasks by providing relevant information and insights

## Things to try

One interesting aspect of the `Llama3-ChatQA-1.5-70B` model is its ability to handle tabular and arithmetic calculations as part of its conversational QA capabilities. You could try prompting the model with questions that involve numerical data or complex reasoning, and observe how it responds. Additionally, the model's retrieval-augmented generation capabilities allow it to provide responses that are grounded in relevant information, which can be useful for tasks that require fact-based answers.

[](#gpt-2b-001)GPT-2B-001
=========================

|[![Model architecture](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)|[![Model size](https://img.shields.io/badge/Params-2B-green)](#model-architecture)|[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets)

[](#model-description)Model Description
---------------------------------------

GPT-2B-001 is a transformer-based language model. GPT refers to a class of transformer decoder-only models similar to GPT-2 and 3 while 2B refers to the total trainable parameter count (2 Billion) \[1, 2\].

This model was trained on 1.1T tokens with [NeMo](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/intro.html).

[](#model-architecture-improvements)Model Architecture improvements
-------------------------------------------------------------------

*   The model uses the SwiGLU activation function \[4\]
*   Rotary positional embeddings (RoPE) \[5\]
*   Maximum sequence length of 4,096 compared to 2,048 in [https://huggingface.co/nvidia/nemo-megatron-gpt-20B](https://huggingface.co/nvidia/nemo-megatron-gpt-20B).
*   No dropout.
*   No bias terms in all linear layers.
*   Untied embedding and output layers.

[](#getting-started)Getting started
-----------------------------------

Note: You will need NVIDIA Ampere or Hopper GPUs to work with this model.

### [](#step-1-install-nemo-and-dependencies)Step 1: Install NeMo and dependencies

You will need to install NVIDIA Apex and NeMo.

    git clone https://github.com/NVIDIA/apex.git
    cd apex
    git checkout 03c9d80ed54c0eaa5b581bf42ceca3162f085327
    pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
    

    pip install nemo_toolkit['nlp']==1.17.0
    

Alternatively, you can use NeMo Megatron training docker container with all dependencies pre-installed.

### [](#step-2-launch-eval-server)Step 2: Launch eval server

**Note.** The example below launches a model variant with Tensor Parallelism (TP) of 1 and Pipeline Parallelism (PP) of 1 on 1 GPU.

    git clone https://github.com/NVIDIA/NeMo.git 
    cd NeMo/examples/nlp/language_modeling
    git checkout v1.17.0
    python megatron_gpt_eval.py gpt_model_file=nemo_2b_bf16_tp1.nemo trainer.precision=bf16 server=True tensor_model_parallel_size=1 trainer.devices=1
    

### [](#step-3-send-prompts-to-your-model)Step 3: Send prompts to your model!

    import json
    import requests
    
    port_num = 5555
    headers = {"Content-Type": "application/json"}
    
    def request_data(data):
        resp = requests.put('http://localhost:{}/generate'.format(port_num),
                            data=json.dumps(data),
                            headers=headers)
        sentences = resp.json()['sentences']
        return sentences
    
    
    data = {
        "sentences": ["It was a warm summer morning when"]*1,
        "tokens_to_generate": 50,
        "temperature": 1.0,
        "add_BOS": False,
        "top_k": 0,
        "top_p": 0.9,
        "greedy": False,
        "all_probs": False,
        "repetition_penalty": 1.2,
        "min_tokens_to_generate": 2,
    }
    
    sentences = request_data(data)
    print(sentences)
    

[](#training-data)Training Data
-------------------------------

The model was trained on 1.1T tokens obtained from publicly available data sources. The dataset comprises 53 languages and code.

[](#evaluation-results)Evaluation results
-----------------------------------------

_Zero-shot performance._ Evaluated using [LM Evaluation Test Suite from AI21](https://github.com/AI21Labs/lm-evaluation)

ARC-Challenge

ARC-Easy

RACE-middle

Winogrande

RTE

BoolQA

HellaSwag

PiQA

0.3558

0.45300

0.3997

0.5801

0.556

0.5979

0.592

0.7437

[](#limitations)Limitations
---------------------------

The model was trained on the data originally crawled from the Internet. This data contains toxic language and societal biases. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. We did not perform any bias/toxicity removal or model alignment on this checkpoint.

[](#references)References
-------------------------

\[1\] [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)

\[2\] [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf)

\[3\] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

\[4\] [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)

\[5\] [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)

[](#licence)Licence
-------------------

License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.

.hf-sanitized.hf-sanitized-k6bASUXjso2I\_1P1GmOLl img { display: inline; }

## Model overview

`GPT-2B-001` is a transformer-based language model developed by [NVIDIA](https://aimodels.fyi/creators/huggingFace/nvidia). It is part of the GPT family of models, similar to GPT-2 and GPT-3, with a total of 2 billion trainable parameters. The model was trained on 1.1 trillion tokens using NVIDIA's NeMo toolkit.

Compared to similar models like [gemma-2b-it](https://aimodels.fyi/models/huggingFace/gemma-2b-it-google-deepmind), [prometheus-13b-v1.0](https://aimodels.fyi/models/huggingFace/prometheus-13b-v10-tomasmcm), and [bge-reranker-base](https://aimodels.fyi/models/huggingFace/bge-reranker-base-ninehills), `GPT-2B-001` features several architectural improvements, including the use of the SwiGLU activation function, rotary positional embeddings, and a longer maximum sequence length of 4,096.

## Model inputs and outputs

### Inputs
- Text prompts of variable length, up to a maximum of 4,096 tokens.

### Outputs
- Continuation of the input text, generated in an autoregressive manner.
- The model can be used for a variety of text-to-text tasks, such as language modeling, text generation, and question answering.

## Capabilities

`GPT-2B-001` is a powerful language model capable of generating human-like text on a wide range of topics. It can be used for tasks such as creative writing, summarization, and even code generation. The model's large size and robust training process allow it to capture complex linguistic patterns and produce coherent, contextually relevant output.

## What can I use it for?

`GPT-2B-001` can be used for a variety of natural language processing tasks, including:

- **Content generation**: The model can be used to generate articles, stories, dialogue, and other forms of text. This can be useful for writers, content creators, and marketers.
- **Question answering**: The model can be fine-tuned to answer questions on a wide range of topics, making it useful for building conversational agents and knowledge-based applications.
- **Summarization**: The model can be used to generate concise summaries of longer text, which can be helpful for researchers, students, and business professionals.
- **Code generation**: The model can be used to generate code snippets and even complete programs, which can assist developers in their work.

## Things to try

One interesting aspect of `GPT-2B-001` is its ability to generate text that is both coherent and creative. Try prompting the model with a simple sentence or phrase and see how it expands upon the idea, generating new and unexpected content. You can also experiment with fine-tuning the model on specific datasets to see how it performs on more specialized tasks.

Another fascinating area to explore is the model's capability for reasoning and logical inference. Try presenting the model with prompts that require deductive or inductive reasoning, and observe how it approaches the problem and formulates its responses.

[](#canary-1b)Canary 1B
=======================

[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture) | [![Model size](https://img.shields.io/badge/Params-1B-lightgrey#model-badge)](#model-architecture) | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)

NVIDIA [NeMo Canary](https://nvidia.github.io/NeMo/blogs/2024/2024-02-canary/) is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).

[](#model-architecture)Model Architecture
-----------------------------------------

Canary is an encoder-decoder model with FastConformer \[1\] encoder and Transformer Decoder \[2\]. With audio features extracted from the encoder, task tokens such as `<source language>`, `<target language>`, `<task>` and `<toggle PnC>` are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer \[5\] from individual SentencePiece \[3\] tokenizers of each language, which makes it easy to scale up to more languages. The Canay-1B model has 24 encoder layers and 24 layers of decoder layers in total.

[](#nvidia-nemo)NVIDIA NeMo
---------------------------

To train, fine-tune or Transcribe with Canary, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version.

    pip install git+https://github.com/NVIDIA/NeMo.git@r1.23.0#egg=nemo_toolkit[asr]
    

[](#how-to-use-this-model)How to Use this Model
-----------------------------------------------

The model is available for use in the NeMo toolkit \[4\], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

### [](#loading-the-model)Loading the Model

    from nemo.collections.asr.models import EncDecMultiTaskModel
    
    # load model
    canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
    
    # update dcode params
    decode_cfg = canary_model.cfg.decoding
    decode_cfg.beam.beam_size = 1
    canary_model.change_decoding_strategy(decode_cfg)
    

### [](#input-format)Input Format

Input to Canary can be either a list of paths to audio files or a jsonl manifest file.

If the input is a list of paths, Canary assumes that the audio is English and Transcribes it. I.e., Canary default behaviour is English ASR.

    predicted_text = canary_model.transcribe(
        paths2audio_files=['path1.wav', 'path2.wav'],
        batch_size=16,  # batch size to run the inference with
    )
    

To use Canary for transcribing other supported languages or perform Speech-to-Text translation, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:

    # Example of a line in input_manifest.json
    {
        "audio_filepath": "/path/to/audio.wav",  # path to the audio file
        "duration": 1000,  # duration of the audio, can be set to `None` if using NeMo main branch
        "taskname": "asr",  # use "s2t_translation" for speech-to-text translation with r1.23, or "ast" if using the NeMo main branch
        "source_lang": "en",  # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
        "target_lang": "en",  # language of the text output, choices=['en','de','es','fr']
        "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
        "answer": "na", 
    }
    

and then use:

    predicted_text = canary_model.transcribe(
        "<path to input manifest file>",
        batch_size=16,  # batch size to run the inference with
    )
    

### [](#automatic-speech-to-text-recognition-asr)Automatic Speech-to-text Recognition (ASR)

An example manifest for transcribing English audios can be:

    # Example of a line in input_manifest.json
    {
        "audio_filepath": "/path/to/audio.wav",  # path to the audio file
        "duration": 1000,  # duration of the audio, can be set to `None` if using NeMo main branch
        "taskname": "asr",  
        "source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
        "target_lang": "en", # language of the text output, choices=['en','de','es','fr']
        "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
        "answer": "na", 
    }
    

### [](#automatic-speech-to-text-translation-ast)Automatic Speech-to-text Translation (AST)

An example manifest for transcribing English audios into German text can be:

    # Example of a line in input_manifest.json
    {
        "audio_filepath": "/path/to/audio.wav",  # path to the audio file
        "duration": 1000,  # duration of the audio, can be set to `None` if using NeMo main branch
        "taskname": "s2t_translation", # r1.23 only recognizes "s2t_translation", but "ast" is supported if using the NeMo main branch
        "source_lang": "en", # language of the audio input, choices=['en','de','es','fr']
        "target_lang": "de", # language of the text output, choices=['en','de','es','fr']
        "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
        "answer": "na" 
    }
    

Alternatively, one can use `transcribe_speech.py` script to do the same.

    python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py 
     pretrained_name="nvidia/canary-1b" 
     audio_dir="<path to audio_directory>" # transcribes all the wav files in audio_directory
    

    python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py 
     pretrained_name="nvidia/canary-1b" 
     dataset_manifest="<path to manifest file>" 
    

### [](#input)Input

This model accepts single channel (mono) audio sampled at 16000 Hz, along with the task/languages/PnC tags as input.

### [](#output)Output

The model outputs the transcribed/translated text corresponding to the input audio, in the specified target language and with or without punctuation and capitalization.

[](#training)Training
---------------------

Canary-1B is trained using the NVIDIA NeMo toolkit \[4\] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs. The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).

The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).

### [](#datasets)Datasets

The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.

The constituents of public data are as follows.

#### [](#english-255k-hours)English (25.5k hours)

*   Librispeech 960 hours
*   Fisher Corpus
*   Switchboard-1 Dataset
*   WSJ-0 and WSJ-1
*   National Speech Corpus (Part 1, Part 6)
*   VCTK
*   VoxPopuli (EN)
*   Europarl-ASR (EN)
*   Multilingual Librispeech (MLS EN) - 2,000 hour subset
*   Mozilla Common Voice (v7.0)
*   People's Speech - 12,000 hour subset
*   Mozilla Common Voice (v11.0) - 1,474 hour subset

#### [](#german-25k-hours)German (2.5k hours)

*   Mozilla Common Voice (v12.0) - 800 hour subset
*   Multilingual Librispeech (MLS DE) - 1,500 hour subset
*   VoxPopuli (DE) - 200 hr subset

#### [](#spanish-14k-hours)Spanish (1.4k hours)

*   Mozilla Common Voice (v12.0) - 395 hour subset
*   Multilingual Librispeech (MLS ES) - 780 hour subset
*   VoxPopuli (ES) - 108 hour subset
*   Fisher - 141 hour subset

#### [](#french-18k-hours)French (1.8k hours)

*   Mozilla Common Voice (v12.0) - 708 hour subset
*   Multilingual Librispeech (MLS FR) - 926 hour subset
*   VoxPopuli (FR) - 165 hour subset

[](#performance)Performance
---------------------------

In both ASR and AST experiments, predictions were generated using beam search with width 5 and length penalty 1.0.

### [](#asr-performance-wo-pnc)ASR Performance (w/o PnC)

The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).

WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:

**Version**

**Model**

**En**

**De**

**Es**

**Fr**

1.23.0

canary-1b

7.97

4.61

3.99

6.53

WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:

**Version**

**Model**

**En**

**De**

**Es**

**Fr**

1.23.0

canary-1b

3.06

4.19

3.15

4.12

More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)

### [](#ast-performance)AST Performance

We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmetrics/stable/text/sacre_bleu_score.html), and use native annotations with punctuation and capitalization in the datasets.

BLEU score on [FLEURS](https://huggingface.co/datasets/google/fleurs) test set:

**Version**

**Model**

**En->De**

**En->Es**

**En->Fr**

**De->En**

**Es->En**

**Fr->En**

1.23.0

canary-1b

32.15

22.66

40.76

33.98

21.80

30.95

BLEU score on [COVOST-v2](https://github.com/facebookresearch/covost) test set:

**Version**

**Model**

**De->En**

**Es->En**

**Fr->En**

1.23.0

canary-1b

37.67

40.7

40.42

BLEU score on [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:

**Version**

**Model**

**En->De**

**En->Es**

**En->Fr**

1.23.0

canary-1b

23.84

35.74

28.29

[](#nvidia-riva-deployment)NVIDIA Riva: Deployment
--------------------------------------------------

[NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded. Additionally, Riva provides:

*   World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
*   Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
*   Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support

Although this model isnt supported yet by Riva, the [list of supported models](https://huggingface.co/models?other=Riva) is here.  
Check out [Riva live demo](https://developer.nvidia.com/riva#demos).

[](#references)References
-------------------------

\[1\] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)

\[2\] [Attention is all you need](https://arxiv.org/abs/1706.03762)

\[3\] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)

\[4\] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

\[5\] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)

[](#licence)Licence
-------------------

License to use this model is covered by the [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en#:~:text=NonCommercial%20%E2%80%94%20You%20may%20not%20use,doing%20anything%20the%20license%20permits.). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-4.0 license.

.hf-sanitized.hf-sanitized-YtrpVN41A6mcsL6w5l6uZ img { display: inline; }

## Model overview

The `canary-1b` model is a part of the NVIDIA NeMo Canary family of multi-lingual, multi-tasking models. With 1 billion parameters, the Canary-1B model supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). The model uses a [FastConformer-Transformer](https://nvidia.github.io/NeMo/blogs/2024/2024-02-canary/) encoder-decoder architecture.

## Model inputs and outputs

### Inputs
- Audio files or a jsonl manifest file containing audio data

### Outputs
- Transcribed text in the specified language (English, German, French, Spanish)
- Translated text to/from the specified language pair

## Capabilities

The Canary-1B model demonstrates state-of-the-art performance on multiple benchmarks for ASR and translation tasks in the supported languages. It can handle various accents, background noise, and technical language well.

## What can I use it for?

The `canary-1b` model is well-suited for research on robust, multi-lingual speech recognition and translation. It can also be fine-tuned on specific datasets to improve performance for particular domains or applications. Developers may find it useful as a pre-trained model for building ASR or translation tools, especially for the supported languages.

## Things to try

You can experiment with the `canary-1b` model by loading it using the [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) toolkit. Try transcribing or translating audio samples in different languages, and compare the results to your expectations or other models. You can also fine-tune the model on your own data to see how it performs on specific tasks or domains.

[](#nemotron-4-340b-base)Nemotron-4-340B-Base
---------------------------------------------

[![Model architecture](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-340B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets)

### [](#model-overview)Model Overview:

Nemotron-4-340B-Base is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. This model has 340 billion parameters, and supports a context length of 4,096 tokens. It is pre-trained for a total of 9 trillion tokens, consisting of a diverse assortment of English-based texts, 50+ natural languages and 40+ coding languages. After an initial pre-training phase of 8 trillion tokens, continuous pre-training of 1 trillion tokens was performed on top of the pre-trained model in order to improve model quality. During continuous pre-training, we altered the data composition by using a different distribution than the one seen at the beginning of training.

Under the NVIDIA Open Model License, NVIDIA confirms:

*   Models are commercially usable.
*   You are free to create and distribute Derivative Models.
*   NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models

### [](#license)License:

[NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf)

### [](#intended-use)Intended use

Nemotron-4-340B-Base is a completion model intended for use in over 50+ natural and 40+ coding languages. It is compatible with [NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html). For best performance on a given task, users are encouraged to customize the model using the NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner).

**Model Developer:** NVIDIA

**Model Dates:** Nemotron-4-340B-Base was trained between December 2023 and May 2024.

### [](#required-hardware)Required Hardware

BF16 Inference:

*   8x H200 (1x H200 node)
*   16x H100 (2x H100 nodes)
*   16x A100 80GB (2x A100 80GB nodes)

### [](#model-architecture)Model Architecture:

Nemotron-4-340B-Base is trained with a global batch-size of 2304, a sequence length of 4096 tokens, uses Grouped-Query Attention (GQA), and Rotary Position Embeddings (RoPE).

**Architecture Type:** Transformer Decoder (auto-regressive language model)

**Network Architecture:** Nemotron-4

### [](#usage)Usage

Deployment and inference with Nemotron-4-340B-Base can be done in three steps using NeMo Framework:

1.  Create a Python script to interact with the deployed model.
2.  Create a Bash script to start the inference server.
3.  Schedule a Slurm job to distribute the model across 2 nodes and associate them with the inference server.

The first task is for the Python Script.

1.  Define the Python script `call_server.py`

    import requests
    import json
    
    headers = {"Content-Type": "application/json"}
    
    def text_generation(data, ip='localhost', port=None):
        resp = requests.put(f'http://{ip}:{port}/generate', data=json.dumps(data), headers=headers)
        return resp.json()
    
    
    def get_generation(prompt, greedy, add_BOS, token_to_gen, min_tokens, temp, top_p, top_k, repetition, batch=False):
        data = {
            "sentences": [prompt] if not batch else prompt,
            "tokens_to_generate": int(token_to_gen),
            "temperature": temp,
            "add_BOS": add_BOS,
            "top_k": top_k,
            "top_p": top_p,
            "greedy": greedy,
            "all_probs": False,
            "repetition_penalty": repetition,
            "min_tokens_to_generate": int(min_tokens),
            "end_strings": ["<|endoftext|>", "<extra_id_1>", "\x11", "<extra_id_1>User"],
        }
        sentences = text_generation(data, port=1424)['sentences']
        return sentences[0] if not batch else sentences
    
    PROMPT_TEMPLATE = "{prompt}"
    
    question = "Write a poem on NVIDIA in the style of Shakespeare"
    prompt = PROMPT_TEMPLATE.format(prompt=question)
    print(prompt)
    
    response = get_generation(prompt, greedy=True, add_BOS=False, token_to_gen=1024, min_tokens=1, temp=1.0, top_p=1.0, top_k=0, repetition=1.0, batch=False)
    response = response[len(prompt):]
    print(response)
    

2.  Given this Python script, create a Bash script which spins up the inference server within the [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) (`docker pull nvcr.io/nvidia/nemo:24.01.framework`) and calls the Python script `call_server.py`. The Bash script `nemo_inference.sh` is as follows:

    NEMO_FILE=$1
    WEB_PORT=1424
    
    depends_on () {
        HOST=$1
        PORT=$2
        STATUS=$(curl -X PUT http://$HOST:$PORT >/dev/null 2>/dev/null; echo $?)
        while [ $STATUS -ne 0 ]
        do
             echo "waiting for server ($HOST:$PORT) to be up"
             sleep 10
             STATUS=$(curl -X PUT http://$HOST:$PORT >/dev/null 2>/dev/null; echo $?)
        done
        echo "server ($HOST:$PORT) is up running"
    }
    
    
    /usr/bin/python3 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \
            gpt_model_file=$NEMO_FILE \
            pipeline_model_parallel_split_rank=0 \
            server=True tensor_model_parallel_size=8 \
            trainer.precision=bf16 pipeline_model_parallel_size=2 \
            trainer.devices=8 \
            trainer.num_nodes=2 \
            web_server=False \
            port=${WEB_PORT} &
        SERVER_PID=$!
    
        readonly local_rank="${LOCAL_RANK:=${SLURM_LOCALID:=${OMPI_COMM_WORLD_LOCAL_RANK:-}}}"
        if [ $SLURM_NODEID -eq 0 ] && [ $local_rank -eq 0 ]; then
            depends_on "0.0.0.0" ${WEB_PORT}
    
            echo "start get json"
            sleep 5
    
            echo "SLURM_NODEID: $SLURM_NODEID"
            echo "local_rank: $local_rank"
            /usr/bin/python3 /scripts/call_server.py
            echo "clean up dameons: $$"
            kill -9 $SERVER_PID
            pkill python
        fi
        wait
    

3.  Launch `nemo_inference.sh` with a Slurm script defined like below, which starts a 2-node job for model inference.

    #!/bin/bash
    #SBATCH -A SLURM-ACCOUNT
    #SBATCH -p SLURM-PARITION
    #SBATCH -N 2
    #SBATCH -J generation      
    #SBATCH --ntasks-per-node=8   
    #SBATCH --gpus-per-node=8
    set -x
    
    RESULTS=<PATH_TO_YOUR_SCRIPTS_FOLDER>
    OUTFILE="${RESULTS}/slurm-%j-%n.out"
    ERRFILE="${RESULTS}/error-%j-%n.out"
    MODEL=<PATH_TO>/Nemotron-4-340B-Base
    CONTAINER="nvcr.io/nvidia/nemo:24.01.framework"
    MOUNTS="--container-mounts=<PATH_TO_YOUR_SCRIPTS_FOLDER>:/scripts,MODEL:/model"
    
    read -r -d '' cmd <<EOF
    bash /scripts/nemo_inference.sh /model
    EOF
    
    srun -o $OUTFILE -e $ERRFILE --container-image="$CONTAINER" $MOUNTS bash -c "${cmd}"
    

### [](#dataset--training)Dataset & Training

The training corpus for Nemotron-4-340B-Base consists of English and multilingual text, as well as code. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. In our continued training set, we introduce a small portion of question-answering, and alignment style data to improve model performance.

**Data Freshness:** The pretraining data has a cutoff of June 2023.

### [](#evaluation-results)Evaluation Results

#### [](#overview)Overview

_5-shot performance._ Language Understanding evaluated using Massive Multitask Language Understanding:

Average

81.1

_Zero-shot performance._ Evaluated using select datasets from the LM Evaluation Harness with additions:

HellaSwag

Winogrande

BBH

ARC-Challenge

90.53

89.50

85.44

94.28

_Chain of Thought (CoT)_. Multilingual capabilities evaluated using Multilingual Grade School Math:

ES Exact Match (%)

JA Exact Match (%)

TH Exact Match (%)

68.8

69.6

68.4

_Code generation performance_. Evaluated using HumanEval:

p@1, 0-Shot

57.3

### [](#limitations)Limitations

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

### [](#ethical-considerations)Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [here.](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-base). Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

## Model Overview

`Nemotron-4-340B-Base` is a large language model (LLM) developed by [NVIDIA](https://aimodels.fyi/creators/huggingFace/nvidia) that can be used as part of a synthetic data generation pipeline. With 340 billion parameters and support for a context length of 4,096 tokens, this multilingual model was pre-trained on a diverse dataset of over 50 natural languages and 40 coding languages. After an initial pre-training phase of 8 trillion tokens, the model underwent continuous pre-training on an additional 1 trillion tokens to improve quality.

Similar models include the [Nemotron-3-8B-Base-4k](https://aimodels.fyi/models/huggingFace/nemotron-3-8b-base-4k-nvidia), a smaller enterprise-ready 8 billion parameter model, and the [GPT-2B-001](https://aimodels.fyi/models/huggingFace/gpt-2b-001-nvidia), a 2 billion parameter multilingual model with architectural improvements.

## Model Inputs and Outputs

`Nemotron-4-340B-Base` is a powerful text generation model that can be used for a variety of natural language tasks. The model accepts textual inputs and generates corresponding text outputs.

### Inputs
- Textual prompts in over 50 natural languages and 40 coding languages

### Outputs
- Coherent, contextually relevant text continuations based on the input prompts

## Capabilities

`Nemotron-4-340B-Base` excels at a range of natural language tasks, including text generation, translation, code generation, and more. The model's large scale and broad multilingual capabilities make it a versatile tool for researchers and developers looking to build advanced language AI applications.

## What Can I Use It For?

`Nemotron-4-340B-Base` is well-suited for use cases that require high-quality, diverse language generation, such as:

- Synthetic data generation for training custom language models
- Multilingual chatbots and virtual assistants
- Automated content creation for websites, blogs, and social media
- Code generation and programming assistants

By leveraging the NVIDIA NeMo Framework and tools like Parameter-Efficient Fine-Tuning and Model Alignment, users can further customize `Nemotron-4-340B-Base` to their specific needs.

## Things to Try

One interesting aspect of `Nemotron-4-340B-Base` is its ability to generate text in a wide range of languages. Try prompting the model with inputs in different languages and observe the quality and coherence of the generated outputs. You can also experiment with combining the model's multilingual capabilities with tasks like translation or cross-lingual information retrieval.

Another area worth exploring is the model's potential for synthetic data generation. By fine-tuning `Nemotron-4-340B-Base` on specific datasets or domains, you can create custom language models tailored to your needs, while leveraging the broad knowledge and capabilities of the base model.

[](#segformer-b0-sized-model-fine-tuned-on-ade20k)SegFormer (b0-sized) model fine-tuned on ADE20k
=================================================================================================

SegFormer model fine-tuned on ADE20k at resolution 512x512. It was introduced in the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Xie et al. and first released in [this repository](https://github.com/NVlabs/SegFormer).

Disclaimer: The team releasing SegFormer did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

SegFormer consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on semantic segmentation benchmarks such as ADE20K and Cityscapes. The hierarchical Transformer is first pre-trained on ImageNet-1k, after which a decode head is added and fine-tuned altogether on a downstream dataset.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for semantic segmentation. See the [model hub](https://huggingface.co/models?other=segformer) to look for fine-tuned versions on a task that interests you.

### [](#how-to-use)How to use

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

    from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
    from PIL import Image
    import requests
    
    processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
    model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits  # shape (batch_size, num_labels, height/4, width/4)
    

For more code examples, we refer to the [documentation](https://huggingface.co/transformers/model_doc/segformer.html#).

### [](#license)License

The license for this model can be found [here](https://github.com/NVlabs/SegFormer/blob/master/LICENSE).

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-2105-15203,
      author    = {Enze Xie and
                   Wenhai Wang and
                   Zhiding Yu and
                   Anima Anandkumar and
                   Jose M. Alvarez and
                   Ping Luo},
      title     = {SegFormer: Simple and Efficient Design for Semantic Segmentation with
                   Transformers},
      journal   = {CoRR},
      volume    = {abs/2105.15203},
      year      = {2021},
      url       = {https://arxiv.org/abs/2105.15203},
      eprinttype = {arXiv},
      eprint    = {2105.15203},
      timestamp = {Wed, 02 Jun 2021 11:46:42 +0200},
      biburl    = {https://dblp.org/rec/journals/corr/abs-2105-15203.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }

## Model overview

The `segformer-b0-finetuned-ade-512-512` model is a version of the SegFormer model fine-tuned on the ADE20k dataset for semantic segmentation. SegFormer is a convolutional neural network architecture that uses a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve strong results on semantic segmentation benchmarks. This particular model was pre-trained on ImageNet-1k and then fine-tuned on the ADE20k dataset at a resolution of 512x512.

The SegFormer architecture is similar to the Vision Transformer (ViT) in that it treats an image as a sequence of patches and uses a Transformer encoder to process them. However, SegFormer uses a more efficient hierarchical design and a lightweight decode head, making it simpler and faster than traditional semantic segmentation models. The [segformer-b2-clothes](https://aimodels.fyi/models/huggingFace/segformerb2clothes-mattmdjaga) model is another example of a SegFormer variant fine-tuned for a specific task, in this case clothes segmentation.

## Model inputs and outputs

### Inputs
- **Images**: The model takes in images as its input, which are split into a sequence of fixed-size patches that are then linearly embedded and processed by the Transformer encoder.

### Outputs
- **Segmentation maps**: The model outputs a segmentation map, where each pixel is assigned a class label corresponding to the semantic category it belongs to (e.g., person, car, building, etc.). The resolution of the output segmentation map is lower than the input image resolution, typically by a factor of 4.

## Capabilities

The `segformer-b0-finetuned-ade-512-512` model is capable of performing semantic segmentation, which is the task of assigning a semantic label to each pixel in an image. It can accurately identify and delineate the various objects, scenes, and regions present in an image. This makes it useful for applications like autonomous driving, scene understanding, and image editing.

## What can I use it for?

This SegFormer model can be used for a variety of semantic segmentation tasks, such as:

- **Autonomous Driving**: Identify and segment different objects on the road (cars, pedestrians, traffic signs, etc.) to enable self-driving capabilities.
- **Scene Understanding**: Understand the composition of a scene by segmenting it into different semantic regions (sky, buildings, vegetation, etc.), which can be useful for applications like robotics and augmented reality.
- **Image Editing**: Perform precise segmentation of objects in an image, allowing for selective editing, masking, and manipulation of specific elements.

The [model hub](https://huggingface.co/models?other=segformer) provides access to a range of SegFormer models fine-tuned on different datasets, so you can explore options that best suit your specific use case.

## Things to try

One interesting aspect of the SegFormer architecture is its hierarchical Transformer encoder, which allows it to capture features at multiple scales. This enables the model to understand the context and relationships between different semantic elements in an image, leading to more accurate and detailed segmentation.

To see this in action, you could try using the `segformer-b0-finetuned-ade-512-512` model on a diverse set of images, ranging from indoor scenes to outdoor landscapes. Observe how the model is able to segment the various objects, textures, and regions in the images, and how the segmentation maps evolve as you move up the hierarchy of the Transformer encoder.

[](#parakeet-rnnt-11b-en)Parakeet RNNT 1.1B (en)
================================================

[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer-lightgrey#model-badge)](#model-architecture) | [![Model size](https://img.shields.io/badge/Params-1.1B-lightgrey#model-badge)](#model-architecture) | [![Language](https://img.shields.io/badge/Language-en-lightgrey#model-badge)](#datasets)

`parakeet-rnnt-1.1b` is an ASR model that transcribes speech in lower case English alphabet. This model is jointly developed by [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Suno.ai](https://www.suno.ai/) teams. It is an XXL version of FastConformer Transducer \[1\] (around 1.1B parameters) model. See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.

[](#nvidia-nemo-training)NVIDIA NeMo: Training
----------------------------------------------

To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.

    pip install nemo_toolkit['all']
    

[](#how-to-use-this-model)How to Use this Model
-----------------------------------------------

The model is available for use in the NeMo toolkit \[3\], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

### [](#automatically-instantiate-the-model)Automatically instantiate the model

    import nemo.collections.asr as nemo_asr
    asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="nvidia/parakeet-rnnt-1.1b")
    

### [](#transcribing-using-python)Transcribing using Python

First, let's get a sample

    wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
    

Then simply do:

    asr_model.transcribe(['2086-149220-0033.wav'])
    

### [](#transcribing-many-audio-files)Transcribing many audio files

    python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py 
     pretrained_name="nvidia/parakeet-rnnt-1.1b" 
     audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
    

### [](#input)Input

This model accepts 16000 Hz mono-channel audio (wav files) as input.

### [](#output)Output

This model provides transcribed speech as a string for a given audio sample.

[](#model-architecture)Model Architecture
-----------------------------------------

FastConformer \[1\] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with a Transducer decoder (RNNT) loss. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).

[](#training)Training
---------------------

The NeMo toolkit \[3\] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).

The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).

### [](#datasets)Datasets

The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.

The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:

*   Librispeech 960 hours of English speech
*   Fisher Corpus
*   Switchboard-1 Dataset
*   WSJ-0 and WSJ-1
*   National Speech Corpus (Part 1, Part 6)
*   VCTK
*   VoxPopuli (EN)
*   Europarl-ASR (EN)
*   Multilingual Librispeech (MLS EN) - 2,000 hour subset
*   Mozilla Common Voice (v7.0)
*   People's Speech - 12,000 hour subset

[](#performance)Performance
---------------------------

The performance of Automatic Speech Recognition models is measuring using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general.

The following tables summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.

**Version**

**Tokenizer**

**Vocabulary Size**

**AMI**

**Earnings-22**

**Giga Speech**

**LS test-clean**

**SPGI Speech**

**TEDLIUM-v3**

**Vox Populi**

**Common Voice**

1.22.0

SentencePiece Unigram

1024

17.10

14.11

9.96

1.46

2.47

3.11

3.92

5.39

These are greedy WER numbers without external LM. More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)

[](#nvidia-riva-deployment)NVIDIA Riva: Deployment
--------------------------------------------------

[NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded. Additionally, Riva provides:

*   World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
*   Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
*   Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support

Although this model isnt supported yet by Riva, the [list of supported models is here](https://huggingface.co/models?other=Riva).  
Check out [Riva live demo](https://developer.nvidia.com/riva#demos).

[](#references)References
-------------------------

\[1\] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)

\[2\] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)

\[3\] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

\[4\] [Suno.ai](https://suno.ai/)

\[5\] [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)

[](#licence)Licence
-------------------

License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.

.hf-sanitized.hf-sanitized-zn9QXmmhZF1Id7CrQS8fq img { display: inline; }

## Model Overview

The `parakeet-rnnt-1.1b` is an ASR (Automatic Speech Recognition) model developed jointly by the [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Suno.ai](https://www.suno.ai/) teams. It uses the FastConformer Transducer architecture, which is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. This XXL model has around 1.1 billion parameters and can transcribe speech in lower case English alphabet with high accuracy.

The model is similar to other high-performing ASR models like [Canary-1B](https://aimodels.fyi/models/huggingFace/canary-1b-nvidia), which also uses the FastConformer architecture but supports multiple languages. In contrast, the `parakeet-rnnt-1.1b` is focused solely on English speech transcription.

## Model Inputs and Outputs

### Inputs
- 16000 Hz mono-channel audio (WAV files)

### Outputs
- Transcribed speech as a string for a given audio sample

## Capabilities

The `parakeet-rnnt-1.1b` model demonstrates state-of-the-art performance on English speech recognition tasks. It was trained on a large, diverse dataset of 85,000 hours of speech data from various public and private sources, including LibriSpeech, Fisher Corpus, Switchboard, and more.

## What Can I Use It For?

The `parakeet-rnnt-1.1b` model is well-suited for a variety of speech-to-text applications, such as voice transcription, dictation, and audio captioning. It could be particularly useful in scenarios where high-accuracy English speech recognition is required, such as in media production, customer service, or educational applications.

## Things to Try

One interesting aspect of the `parakeet-rnnt-1.1b` model is its ability to handle a wide range of audio inputs, from clear studio recordings to noisier real-world audio. You could experiment with feeding it different types of audio samples and observe how it performs in terms of transcription accuracy and robustness.

Additionally, since the model was trained on a large and diverse dataset, you could try fine-tuning it on a more specialized domain or genre of audio to see if you can further improve its performance for your specific use case.

[](#mistral-nemo-12b-instruct)Mistral-NeMo-12B-Instruct
-------------------------------------------------------

[![Model architecture](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-12B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets)

### [](#model-overview)Model Overview:

Mistral-NeMo-12B-Instruct is a Large Language Model (LLM) composed of 12B parameters, trained jointly by NVIDIA and Mistral AI. It significantly outperforms existing models smaller or similar in size.

**Key features**

*   Released under the Apache 2 License
*   Pre-trained and instructed versions
*   Trained with a 128k context window
*   Comes with a FP8 quantized version with no accuracy loss
*   Trained on a large proportion of multilingual and code data

### [](#intended-use)Intended use

Mistral-NeMo-12B-Instruct is a chat model intended for use for the English language.

The instruct model itself can be further customized using the [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html) suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner).

**Model Developer:** [NVIDIA](https://www.nvidia.com/en-us/) and [MistralAI](https://mistral.ai/)

**Model Dates:** Mistral-NeMo-12B-Instruct was trained between June 2024 and July 2024.

**Data Freshness:** The pretraining data has a cutoff of April 2024.

**Transformers format:** [https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)

### [](#model-architecture)Model Architecture:

Mistral-NeMo-12B-Instruct is a transformer model, with the following architecture choices:

*   Layers: 40
*   Dim: 5,120
*   Head dim: 128
*   Hidden dim: 14,436
*   Activation Function: SwiGLU
*   Number of heads: 32
*   Number of kv-heads: 8 (GQA)
*   Rotary embeddings (theta = 1M)
*   Vocabulary size: 2\*\*17 ~= 128k

**Architecture Type:** Transformer Decoder (auto-regressive language model)

### [](#evaluation-results)Evaluation Results

*   MT Bench (dev): 7.84
*   MixEval Hard: 0.534
*   IFEval-v5: 0.629
*   Wildbench: 42.57

### [](#limitations)Limitations

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

### [](#ethical-considerations)Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

## Model overview

`Mistral-NeMo-12B-Instruct` is a large language model (LLM) composed of 12 billion parameters, trained jointly by NVIDIA and Mistral AI. It significantly outperforms existing models of similar or smaller size. The model is available in both pre-trained and instructed versions, and is trained with a large 128k context window. It also comes with a FP8 quantized version that maintains accuracy. A notable feature is that the model is trained on a large proportion of multilingual and code data.

Similar models from Mistral AI include the [Mistral-Nemo-Instruct-2407](https://aimodels.fyi/models/huggingFace/mistral-nemo-instruct-2407-mistralai), [Mistral-Nemo-Base-2407](https://aimodels.fyi/models/huggingFace/mistral-nemo-base-2407-mistralai), [Mistral-Large-Instruct-2407](https://aimodels.fyi/models/huggingFace/mistral-large-instruct-2407-mistralai), and earlier versions of the Mistral-7B models. All of these share common architectural choices like a transformer decoder, rotary embeddings, and a large vocabulary size.

## Model inputs and outputs

### Inputs
- **Text prompt**: The model takes a text prompt as input, which can be in multiple languages.

### Outputs
- **Generated text**: The model outputs generated text in response to the input prompt. The output can be in multiple languages and can include code as well as natural language.

## Capabilities

`Mistral-NeMo-12B-Instruct` has strong capabilities across a wide range of natural language tasks, including language generation, translation, question answering, and text summarization. It also exhibits impressive abilities in code generation and reasoning. The model's large size and diverse training data allow it to perform well on a variety of benchmarks, often outperforming smaller models.

## What can I use it for?

The `Mistral-NeMo-12B-Instruct` model can be used for a variety of applications, such as building chatbots, virtual assistants, and language-based AI applications. Its capabilities in code generation and reasoning make it well-suited for tasks like programming assistance, technical writing, and even creative problem-solving. The model's multilingual abilities also enable cross-language applications, such as translation services and international customer support.

## Things to try

One interesting thing to try with `Mistral-NeMo-12B-Instruct` is prompt engineering - experimenting with different input prompts to see how the model responds and what kinds of outputs it generates. The model's strong reasoning and language generation abilities mean that it can be used to tackle a wide variety of tasks, from open-ended conversation to task-oriented problem-solving. Developers and researchers may also want to explore the model's potential for few-shot or zero-shot learning, where it can be fine-tuned or adapted to new domains and tasks with minimal additional training.

[](#nemotron-4-340b-reward)Nemotron-4-340B-Reward
-------------------------------------------------

[![Model architecture](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-340B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-English-green)](#datasets)

### [](#model-overview)Model Overview

The Nemotron-4-340B-Reward is a multi-dimensional Reward Model that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs; Nemotron-4-340B-Reward consists of the Nemotron-4-340B-Base model and a linear layer that converts the final layer representation of the end-of-response token into five scalar values, each corresponding to a [HelpSteer2](https://arxiv.org/abs/2406.08673) attribute.

It supports a context length of up to 4,096 tokens.

Given a conversation with multiple turns between user and assistant, it rates the following attributes (typically between 0 and 4) for every assistant turn.

1.  **Helpfulness**: Overall helpfulness of the response to the prompt.
2.  **Correctness**: Inclusion of all pertinent facts without errors.
3.  **Coherence**: Consistency and clarity of expression.
4.  **Complexity**: Intellectual depth required to write response (i.e. whether the response can be written by anyone with basic language competency or requires deep domain expertise).
5.  **Verbosity**: Amount of detail included in the response, relative to what is asked for in the prompt.

Nonetheless, if you are only interested in using it as a conventional reward model that outputs a singular scalar, we recommend using the weights `[0, 0, 0, 0, 0.3, 0.74, 0.46, 0.47, -0.33]` to do elementwise multiplication with the predicted attributes (which outputs 9 float values in line with [Llama2-13B-SteerLM-RM](https://huggingface.co/nvidia/Llama2-13B-SteerLM-RM) but the first four are not trained or used)

Under the NVIDIA Open Model License, NVIDIA confirms:

*   Models are commercially usable.
*   You are free to create and distribute Derivative Models.
*   NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

### [](#license)License:

[NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf)

### [](#intended-use)Intended use

Nemotron-4 340B Reward Model is a pretrained Reward Model intended for use in English Synthetic Data Generation and English Reinforcement Learning from AI Feedback (RLAIF).

Nemotron-4 340B-Reward can be used in the alignment stage to align pretrained models to human preferences. It can also be used in cases like Reward-Model-as-a-Judge.

**Model Developer:** NVIDIA

**Model Input:** Text only  
**Input Format:** String

**Model Output:** Scalar Values (List of 9 Floats)  
**Output Format:** Float

**Model Dates:** Nemotron-4-340B-Reward was trained between December 2023 and May 2024

**Data Freshness:** The pretraining data has a cutoff of June 2023

### [](#required-hardware)Required Hardware

BF16 Inference:

*   16x H100 (2x H100 Nodes)
*   16x A100 (2x A100 80GB Nodes)

### [](#usage)Usage:

You can use the model with [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner) following [SteerLM training user guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/steerlm.html).

1.  Spin up an inference server within the [NeMo Aligner container](https://github.com/NVIDIA/NeMo-Aligner/blob/main/Dockerfile)

    python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \
          rm_model_file=Nemotron-4-340B-Reward \
          trainer.num_nodes=2 \
          trainer.devices=8 \
          ++model.tensor_model_parallel_size=8 \
          ++model.pipeline_model_parallel_size=2 \
          inference.micro_batch_size=2 \
          inference.port=1424
    

2.  Annotate data files using the served reward model. As an example, this can be the Open Assistant train/val files. Then follow the next step to train a SteerLM model based on [SteerLM training user guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/steerlm.html#step-5-train-the-attribute-conditioned-sft-model) .

    python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst
    
    python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \
          --input-file=data/oasst/train.jsonl \
          --output-file=data/oasst/train_labeled.jsonl \
          --port=1424
    

3.  Alternatively, this can be any conversational data file (in .jsonl) in the following format, where each line looks like

    {
        "conversations": [
                  {"value": <user_turn_1>, "from": "User", "label": None},
                  {"value": <assistant_turn_1>, "from": "Assistant", "label": <formatted_label_1>},
                  {"value": <user_turn_2>, "from": "User", "label": None},
                  {"value": <assistant_turn_2>, "from": "Assistant", "label": <formatted_label_2>},
              ],
        "mask": "User"
    }
    

Ideally, each `<formatted_label_n>` refers to the ground truth label for the assistant turn but if they are not available, we can also use `helpfulness:4,correctness:4,coherence:4,complexity:2,verbosity:2` (i.e. defaulting to moderate complexity and verbosity, adjust if needed. or simply `helpfulness:-1`. It must not be `None` or an empty string.

### [](#model-architecture)Model Architecture:

Nemotron-4-340B-Reward is extended from Nemotron-4-340B-Base with an additional linear layer. It was trained with a global batch-size of 128.

**Architecture Type:** Transformer Decoder (auto-regressive language model)

### [](#intended-use-1)Intended use

Nemotron-4-340B-Reward is a pretrained Reward Model intended for use in English Synthetic Data Generation and English Reinforcement Learning from AI Feedback (RLAIF).

### [](#dataset--training)Dataset & Training

Nemotron-4-340B-Reward was trained for 2 epochs using the NVIDIA [HelpSteer2](https://arxiv.org/abs/2406.08673) data. The HelpSteer2 dataset is a permissively licensed preference dataset (CC-by-4.0) with ten thousand English response pairs and can be found [here](https://huggingface.co/datasets/nvidia/HelpSteer2).

### [](#evaluation-results)Evaluation Results

#### [](#reward-bench-primary-dataset)Reward Bench Primary Dataset

Evaluated using RewardBench - as introduced in the paper [RewardBench: Evaluating Reward Models for Language Modeling](https://arxiv.org/abs/2403.13787).

Overall

Chat

Chat-Hard

Safety

Reasoning

92.0

95.8

87.1

91.5

93.7

### [](#limitations)Limitations

This model was trained using an English dataset, and as such its use is optimized for English language use cases. In order to extend this model to other language domains, fine-tuning will be required.

### [](#ethical-considerations)Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-reward). Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

### [](#citation)Citation

If you find this model useful, please cite the following works

    @misc{wang2024helpsteer2,
          title={HelpSteer2: Open-source dataset for training top-performing reward models}, 
          author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
          year={2024},
          eprint={2406.08673},
          archivePrefix={arXiv},
          primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
    }

## Model Overview

The `Nemotron-4-340B-Reward` is a multi-dimensional reward model developed by NVIDIA. It is based on the larger `Nemotron-4-340B-Base` model, which is a 340 billion parameter language model trained on a diverse corpus of English and multilingual text, as well as code. 

The `Nemotron-4-340B-Reward` model takes a conversation between a user and an assistant, and rates the assistant's responses across five attributes: helpfulness, correctness, coherence, complexity, and verbosity. It outputs a scalar value for each of these attributes, providing a nuanced evaluation of the response quality. This model can be used as part of a synthetic data generation pipeline to create training data for other language models, or as a standalone reward model for reinforcement learning from AI feedback.

The model is compatible with the [NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html), which provides tools for customizing and deploying large language models. Similar models in the Nemotron family include the `Nemotron-4-340B-Base` and `Nemotron-3-8B-Base-4k`, which are large language models that can be used as foundations for building custom AI applications.

## Model Inputs and Outputs

### Inputs
- A conversation with multiple turns between a user and an assistant

### Outputs
- A scalar value (typically between 0 and 4) for each of the following attributes:
  - Helpfulness: Overall helpfulness of the assistant's response to the prompt
  - Correctness: Inclusion of all pertinent facts without errors
  - Coherence: Consistency and clarity of expression
  - Complexity: Intellectual depth required to write the response
  - Verbosity: Amount of detail included in the response, relative to what is asked for in the prompt

## Capabilities

The `Nemotron-4-340B-Reward` model can be used to evaluate the quality of assistant responses in a nuanced way, providing insights into different aspects of the response. This can be useful for building AI systems that provide helpful and coherent responses, as well as for generating high-quality synthetic training data for other language models.

## What Can I Use It For?

The `Nemotron-4-340B-Reward` model can be used in a variety of applications that require evaluating the quality of language model outputs. Some potential use cases include:

- **Synthetic Data Generation**: The model can be used as part of a pipeline to generate high-quality training data for other language models, by providing a reward signal to guide the generation process.
- **Reinforcement Learning from AI Feedback (RLAIF)**: The model can be used as a reward model in RLAIF, where a language model is fine-tuned to optimize for the target attributes (helpfulness, correctness, etc.) as defined by the reward model.
- **Reward-Model-as-a-Judge**: The model can be used to evaluate the outputs of other language models, providing a more nuanced assessment than a simple binary pass/fail.

## Things to Try

One interesting aspect of the `Nemotron-4-340B-Reward` model is its ability to provide a multi-dimensional evaluation of language model outputs. This can be useful for understanding the strengths and weaknesses of different models, and for identifying areas for improvement.

For example, you could use the model to evaluate the responses of different language models on a set of prompts, and compare the scores across the different attributes. This could reveal that a model is good at producing coherent and helpful responses, but struggles with providing factually correct information. Armed with this insight, you could then focus on improving the model's knowledge base or fact-checking capabilities.

Additionally, you could experiment with using the `Nemotron-4-340B-Reward` model as part of a reinforcement learning pipeline, where the model's output is used as a reward signal to fine-tune a language model. This could potentially lead to models that are better aligned with human preferences and priorities, as defined by the reward model's attributes.