Proteus v0.2 shows subtle yet significant improvements over Version 0.1. It demonstrates enhanced prompt understanding that surpasses MJ6, while also approaching its stylistic capabilities.

## Model overview

`proteus-v0.2` is an AI model created by [datacte](https://aimodels.fyi/creators/replicate/datacte) that builds upon the capabilities of previous Proteus versions. It demonstrates enhanced prompt understanding that surpasses MJ6, while also approaching the stylistic capabilities of its predecessor. Compared to [Proteus v0.1](https://aimodels.fyi/models/replicate/proteus-v01-lucataco), the latest version shows subtle yet significant improvements in prompt comprehension and stylistic output. The model also shares similarities with other Proteus iterations, such as [Proteus v0.4](https://aimodels.fyi/models/replicate/proteus-v04-datacte) and [Proteus v0.4 Lightning](https://aimodels.fyi/models/replicate/proteus-v04-lightning-datacte), which focus on enhancing stylistic capabilities.

## Model inputs and outputs

`proteus-v0.2` is a text-to-image generation model that takes in a prompt and generates corresponding images. The model accepts a variety of input parameters, including the prompt, image size, and settings for the image generation process, such as the number of inference steps and guidance scale.

### Inputs
- **Prompt**: The text description of the desired image
- **Negative Prompt**: Provides additional context to guide the image generation process
- **Image**: An input image for img2img or inpaint mode
- **Mask**: An input mask for inpaint mode, where black areas are preserved and white areas are inpainted
- **Width/Height**: The desired dimensions of the output image
- **Seed**: A random seed value to control image generation
- **Scheduler**: The algorithm used for image generation
- **Num Outputs**: The number of images to generate
- **Guidance Scale**: The scale for classifier-free guidance, which helps control the balance between the prompt and the model's own biases
- **Prompt Strength**: The strength of the prompt when using img2img or inpaint modes
- **Num Inference Steps**: The number of denoising steps to perform during image generation
- **Apply Watermark**: A toggle to apply a watermark to the generated images

### Outputs
- **Generated Images**: The output images generated based on the input prompt and parameters

## Capabilities

`proteus-v0.2` demonstrates enhanced prompt understanding and stylistic capabilities compared to its predecessor, [Proteus v0.1](https://aimodels.fyi/models/replicate/proteus-v01-lucataco). The model is able to generate images that more closely adhere to the provided prompt, with improved detail and visual fidelity. While it does not surpass the stylistic capabilities of [Proteus v0.4](https://aimodels.fyi/models/replicate/proteus-v04-datacte) or [Proteus v0.4 Lightning](https://aimodels.fyi/models/replicate/proteus-v04-lightning-datacte), it approaches a similar level of performance.

## What can I use it for?

`proteus-v0.2` can be used for a variety of text-to-image generation tasks, such as creating concept art, illustrations, or visualizations based on textual descriptions. The model's improved prompt understanding and stylistic capabilities make it a valuable tool for artists, designers, and anyone looking to generate high-quality images from text. The model could be particularly useful for projects that require a balance between adhering to a specific prompt and maintaining a polished, visually appealing aesthetic.

## Things to try

Experiment with different prompts to see how `proteus-v0.2` interprets and renders various scenes, characters, and styles. Try combining the model with other image editing or manipulation tools to further refine the generated outputs. Additionally, consider exploring the model's performance on specific types of prompts, such as those involving detailed landscapes, fantastical creatures, or technical illustrations, to uncover its strengths and limitations.

Blip 3 / XGen-MM, Answers questions about images ({blip3,xgen-mm}-phi3-mini-base-r-v1)

## Model overview

`blip-3` is a series of large multimodal models (LMMs) developed by Salesforce AI Research. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. `blip3-phi3-mini-instruct-r-v1` is a fine-tuned version of the pretrained `blip3-phi3-mini-base-r-v1` model that achieves state-of-the-art performance among open-source and closed-source vision-language models under 5 billion parameters. It supports flexible high-resolution image encoding with efficient visual token sampling.

The `blip-3` model series is related to other multimodal models like [SDXL-Lightning](https://aimodels.fyi/models/replicate/sdxl-lightning-4step-bytedance) from ByteDance, which generates high-quality images in 4 steps, and the original [BLIP](https://aimodels.fyi/models/replicate/blip-salesforce) model from Salesforce, which generates image captions. The [BLIP-2](https://aimodels.fyi/models/replicate/blip-2-andreasjansson) model from Andreas Jansson also answers questions about images.

## Model inputs and outputs

### Inputs
- **Image**: The input image to generate captions or answer questions about.
- **Question**: The question to ask about the input image.
- **Context** (optional): Previous questions and answers to use as context for answering the current question.
- **Miscellaneous parameters**: Options to control the output, such as the number of top tokens to consider, the temperature for sampling, and whether to use beam search.

### Outputs
- **String**: The model's response to the input question, either a caption or an answer.

## Capabilities

The `blip-3` models excel at answering questions about images, with state-of-the-art performance on benchmarks like COCO, NoCaps, TextCaps, OKVQA, TextVQA, VizWiz, and VQAv2. They can provide detailed, polite, and helpful answers to a wide variety of image-related questions.

## What can I use it for?

The `blip-3` models can be useful for building applications that need to understand and reason about images, such as:

- Visual question answering systems
- Image captioning tools
- Multimodal search engines
- Automated image analysis for e-commerce or other domains

The [maintainer's profile](https://aimodels.fyi/creators/replicate/zsxkib) also showcases their work on the related [uform-gen](https://aimodels.fyi/models/replicate/uform-gen-zsxkib) model, which is a fast 1.5B image captioning and VQA multimodal language model.

## Things to try

One interesting aspect of the `blip-3` models is their ability to perform in-context learning, where they can leverage previous questions and answers to provide more contextual responses. You could experiment with different ways of providing context to the model and see how it affects the quality and relevance of the answers.

Another area to explore is the model's performance on specialized tasks like document understanding, chart analysis, or OCR-related questions. The README mentions the model was trained on a mixture of academic VQA datasets covering these types of tasks, so it could be worth testing its capabilities in these domains.

[](#model-card-for-mistral-nemo-instruct-2407)Model Card for Mistral-Nemo-Instruct-2407
=======================================================================================

The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407). Trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size.

For more details about this model please refer to our release [blog post](https://mistral.ai/news/mistral-nemo/).

[](#key-features)Key features
-----------------------------

*   Released under the **Apache 2 License**
*   Pre-trained and instructed versions
*   Trained with a **128k context window**
*   Trained on a large proportion of **multilingual and code data**
*   Drop-in replacement of Mistral 7B

[](#model-architecture)Model Architecture
-----------------------------------------

Mistral Nemo is a transformer model, with the following architecture choices:

*   **Layers:** 40
*   **Dim:** 5,120
*   **Head dim:** 128
*   **Hidden dim:** 14,436
*   **Activation Function:** SwiGLU
*   **Number of heads:** 32
*   **Number of kv-heads:** 8 (GQA)
*   **Vocabulary size:** 2\*\*17 ~= 128k
*   **Rotary embeddings (theta = 1M)**

[](#metrics)Metrics
-------------------

### [](#main-benchmarks)Main Benchmarks

Benchmark

Score

HellaSwag (0-shot)

83.5%

Winogrande (0-shot)

76.8%

OpenBookQA (0-shot)

60.6%

CommonSenseQA (0-shot)

70.4%

TruthfulQA (0-shot)

50.3%

MMLU (5-shot)

68.0%

TriviaQA (5-shot)

73.8%

NaturalQuestions (5-shot)

31.2%

### [](#multilingual-benchmarks-mmlu)Multilingual Benchmarks (MMLU)

Language

Score

French

62.3%

German

62.7%

Spanish

64.6%

Italian

61.3%

Portuguese

63.3%

Russian

59.2%

Chinese

59.0%

Japanese

59.0%

[](#usage)Usage
---------------

The model can be used with three different frameworks

*   [`mistral_inference`](https://github.com/mistralai/mistral-inference): See [here](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407#mistral-inference)
*   [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)
*   [`NeMo`](https://github.com/NVIDIA/NeMo): See [nvidia/Mistral-NeMo-12B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct)

### [](#mistral-inference)Mistral Inference

#### [](#install)Install

It is recommended to use `mistralai/Mistral-Nemo-Instruct-2407` with [mistral-inference](https://github.com/mistralai/mistral-inference). For HF transformers code snippets, please keep scrolling.

    pip install mistral_inference
    

#### [](#download)Download

    from huggingface_hub import snapshot_download
    from pathlib import Path
    
    mistral_models_path = Path.home().joinpath('mistral_models', 'Nemo-Instruct')
    mistral_models_path.mkdir(parents=True, exist_ok=True)
    
    snapshot_download(repo_id="mistralai/Mistral-Nemo-Instruct-2407", allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], local_dir=mistral_models_path)
    

#### [](#chat)Chat

After installing `mistral_inference`, a `mistral-chat` CLI command should be available in your environment. You can chat with the model using

    mistral-chat $HOME/mistral_models/Nemo-Instruct --instruct --max_tokens 256 --temperature 0.35
    

_E.g._ Try out something like:

    How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar.
    

#### [](#instruct-following)Instruct following

    from mistral_inference.transformer import Transformer
    from mistral_inference.generate import generate
    
    from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
    from mistral_common.protocol.instruct.messages import UserMessage
    from mistral_common.protocol.instruct.request import ChatCompletionRequest
    
    tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
    model = Transformer.from_folder(mistral_models_path)
    
    prompt = "How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar."
    
    completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])
    
    tokens = tokenizer.encode_chat_completion(completion_request).tokens
    
    out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
    result = tokenizer.decode(out_tokens[0])
    
    print(result)
    

#### [](#function-calling)Function calling

    from mistral_common.protocol.instruct.tool_calls import Function, Tool
    from mistral_inference.transformer import Transformer
    from mistral_inference.generate import generate
    
    from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
    from mistral_common.protocol.instruct.messages import UserMessage
    from mistral_common.protocol.instruct.request import ChatCompletionRequest
    
    
    tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
    model = Transformer.from_folder(mistral_models_path)
    
    completion_request = ChatCompletionRequest(
        tools=[
            Tool(
                function=Function(
                    name="get_current_weather",
                    description="Get the current weather",
                    parameters={
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The city and state, e.g. San Francisco, CA",
                            },
                            "format": {
                                "type": "string",
                                "enum": ["celsius", "fahrenheit"],
                                "description": "The temperature unit to use. Infer this from the users location.",
                            },
                        },
                        "required": ["location", "format"],
                    },
                )
            )
        ],
        messages=[
            UserMessage(content="What's the weather like today in Paris?"),
            ],
    )
    
    tokens = tokenizer.encode_chat_completion(completion_request).tokens
    
    out_tokens, _ = generate([tokens], model, max_tokens=256, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
    result = tokenizer.decode(out_tokens[0])
    
    print(result)
    

### [](#transformers)Transformers

> NOTE: Until a new release has been made, you need to install transformers from source:
> 
>     pip install git+https://github.com/huggingface/transformers.git
>     

If you want to use Hugging Face `transformers` to generate text, you can do something like this.

    from transformers import pipeline
    
    messages = [
        {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
        {"role": "user", "content": "Who are you?"},
    ]
    chatbot = pipeline("text-generation", model="mistralai/Mistral-Nemo-Instruct-2407")
    chatbot(messages)
    

> Unlike previous Mistral models, Mistral Nemo requires smaller temperatures. We recommend to use a temperature of 0.3.

[](#limitations)Limitations
---------------------------

The Mistral Nemo Instruct model is a quick demonstration that the base model can be easily fine-tuned to achieve compelling performance. It does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.

[](#the-mistral-ai-team)The Mistral AI Team
-------------------------------------------

Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Alok Kothari, Antoine Roux, Arthur Mensch, Audrey Herblin-Stoop, Augustin Garreau, Austin Birky, Bam4d, Baptiste Bout, Baudouin de Monicault, Blanche Savary, Carole Rambaud, Caroline Feldman, Devendra Singh Chaplot, Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, Gaspard Blanchet, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, Henri Roussez, Hichem Sattouf, Ian Mack, Jean-Malo Delignon, Jessica Chudnovsky, Justus Murke, Kartik Khandelwal, Lawrence Stewart, Louis Martin, Louis Ternon, Lucile Saulnier, Llio Renard Lavaud, Margaret Jennings, Marie Pellat, Marie Torelli, Marie-Anne Lachaux, Marjorie Janiewicz, Mickal Seznec, Nicolas Schuhl, Niklas Muhs, Olivier de Garrigues, Patrick von Platen, Paul Jacob, Pauline Buche, Pavan Kumar Reddy, Perry Savas, Pierre Stock, Romain Sauvestre, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Sophia Yang, Szymon Antoniak, Teven Le Scao, Thibault Schueller, Thibaut Lavril, Thomas Wang, Thophile Gervet, Timothe Lacroix, Valera Nemychnikova, Wendy Shang, William El Sayed, William Marshall

## Model Overview

The `Mistral-Nemo-Instruct-2407` is a Large Language Model (LLM) that has been fine-tuned for instructional tasks. It is an instruct version of the [Mistral-Nemo-Base-2407](https://aimodels.fyi/models/huggingFace/mistral-nemo-base-2407-mistralai) model, which was jointly trained by Mistral AI and NVIDIA. The `Mistral-Nemo-Instruct-2407` model significantly outperforms existing models of similar or smaller size.

## Model Inputs and Outputs

The `Mistral-Nemo-Instruct-2407` model takes text inputs and generates text outputs. It can be used for a variety of natural language processing tasks, including:

### Inputs
- Free-form text prompts

### Outputs
- Coherent, contextual text completions
- Responses to instructions or prompts

## Capabilities

The `Mistral-Nemo-Instruct-2407` model has strong capabilities in areas such as reasoning, knowledge, and coding. It performs well on a variety of benchmark tasks, including HellaSwag, Winogrande, OpenBookQA, CommonSenseQA, and TriviaQA.

## What Can I Use It For?

The `Mistral-Nemo-Instruct-2407` model can be used for a wide range of natural language processing applications, such as:

- **Content Generation**: Generating coherent and contextual text, including stories, articles, and other creative content.
- **Question Answering**: Answering questions on a variety of topics by drawing upon its broad knowledge base.
- **Instructional Tasks**: Following and executing complex instructions or prompts, such as those related to coding, math, or task planning.

## Things to Try

Some interesting things to try with the `Mistral-Nemo-Instruct-2407` model include:

- Experimenting with different prompting strategies to see how the model responds to various types of instructions or queries.
- Exploring the model's multilingual capabilities by providing prompts in different languages.
- Testing the model's coding and reasoning abilities by presenting it with math problems, coding challenges, or open-ended questions that require logical thinking.

## Model overview

The `Meta-Llama-3.1-8B-Instruct` is a family of multilingual large language models (LLMs) developed by Meta that are pretrained and instruction tuned for various text-based tasks. The [Meta Llama 3.1 collection](https://aimodels.fyi/creators/huggingFace/meta-llama) includes models in 8B, 70B, and 405B parameter sizes, all optimized for multilingual dialogue use cases. The 8B instruction tuned model outperforms many open-source chat models on common industry benchmarks, while the larger 70B and 405B versions offer even greater capabilities.

## Model inputs and outputs

### Inputs
- Multilingual text input

### Outputs
- Multilingual text and code output

## Capabilities

The `Meta-Llama-3.1-8B-Instruct` model has strong capabilities in areas like language understanding, knowledge reasoning, and code generation. It can engage in open-ended dialogue, answer questions, and even write code in multiple languages. The model was carefully developed with a focus on helpfulness and safety, making it suitable for a wide range of commercial and research applications.

## What can I use it for?

The `Meta-Llama-3.1-8B-Instruct` model is intended for use in commercial and research settings across a variety of domains and languages. The instruction tuned version is well-suited for building assistant-like chatbots, while the pretrained models can be adapted for tasks like content generation, summarization, and creative writing. Developers can also leverage the model's outputs to improve other models through techniques like synthetic data generation and distillation.

## Things to try

One interesting aspect of the `Meta-Llama-3.1-8B-Instruct` model is its multilingual capabilities. Developers can fine-tune the model for use in languages beyond the core set of English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai that are supported out-of-the-box. This opens up a wide range of possibilities for building conversational AI applications tailored to specific regional or cultural needs.

![DCLM Logo](https://cdn-uploads.huggingface.co/production/uploads/63118add64939fabc0108b28/BB42g4V8HTxb5dR4tcy8A.png)

[](#model-card-for-dclm-baseline-7b)Model Card for DCLM-Baseline-7B
===================================================================

DCLM-Baseline-7B is a 7 billion parameter language model trained on the DCLM-Baseline dataset, which was curated as part of the DataComp for Language Models (DCLM) benchmark. This model is designed to showcase the effectiveness of systematic data curation techniques for improving language model performance.

[](#model-details)Model Details
-------------------------------

Size

Training Tokens

Layers

Hidden Size

Attention Heads

Context Length

7B

2.5T

32

4096

32

2048

### [](#model-description)Model Description

*   **Developed by:** DataComp for Language Models (DCLM) Team
*   **Model type:** Decoder-only Transformer language model
*   **Language(s):** English (primarily)
*   **License:** Apple Sample Code License
*   **Contact:** [contact@datacomp.ai](mailto:contact@datacomp.ai)
*   **Date:** June 2024

### [](#model-sources)Model Sources

*   **Repository:** [https://github.com/mlfoundations/dclm](https://github.com/mlfoundations/dclm)
*   **Dataset:** [https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0)
*   **Paper:** [DataComp-LM: In search of the next generation of training sets for language models](https://arxiv.org/abs/2406.11794)

[](#using-model)Using Model
---------------------------

First install open\_lm

    pip install git+https://github.com/mlfoundations/open_lm.git
    

Then:

    from open_lm.hf import *
    from transformers import AutoTokenizer, AutoModelForCausalLM
    tokenizer = AutoTokenizer.from_pretrained("apple/DCLM-Baseline-7B")
    model = AutoModelForCausalLM.from_pretrained("apple/DCLM-Baseline-7B")
    
    inputs = tokenizer(["Machine learning is"], return_tensors="pt")
    gen_kwargs = {"max_new_tokens": 50, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.1}
    output = model.generate(inputs['input_ids'], **gen_kwargs)
    output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
    print(output)
    

### [](#training-details)Training Details

The model was trained using the following setup:

*   **Architecture:** Decoder-only Transformer
*   **Framework:** PyTorch with OpenLM
*   **Optimizer:** AdamW
*   **Learning Rate:** 2e-3 (peak)
*   **Weight Decay:** 0.05
*   **Batch Size:** 2048 sequences
*   **Sequence Length:** 2048 tokens
*   **Total Training Tokens:** 2.5T
*   **Hardware:** Trained on H100 GPUs

For more detailed training information, please refer to Section 3.4 and Appendix F of the DCLM paper. To ensure our trained model is broadly useful, including for math and coding tasks, we combine our 3.8T [DCLM-BASELINE](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) with the [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) and [ProofPile2](https://huggingface.co/datasets/EleutherAI/proof-pile-2) data to arrive at a 4.1T token dataset.

[](#evaluation)Evaluation
-------------------------

Here are the evaluation results for DCLM-Baseline-7B on various tasks (using [llm-foundry](https://github.com/mosaicml/llm-foundry) eval suite)

Task

Score

MMLU (zero-shot)

0.5766

MMLU (few-shot)

0.6372

HellaSwag (zero-shot)

0.7987

HellaSwag

0.8043

Jeopardy

0.4745

TriviaQA

0.5270

GSM8K (CoT)

0.0250

AGI Eval SAT Math (CoT)

0.0136

AQuA (CoT)

0.0490

SVAMP (CoT)

0.4900

BigBench QA Wikidata

0.7120

ARC Easy

0.8220

ARC Challenge

0.5990

BigBench Misconceptions

0.6986

COPA

0.8500

SIQA

0.8291

CommonsenseQA

0.8018

PIQA

0.8128

OpenBookQA

0.4540

BigBench Novel Concepts

0.7188

BigBench Strange Stories

0.7586

BigBench Strategy QA

0.6173

LAMBADA

0.8220

Winograd

0.8828

Winogrande

0.7269

BigBench Conlang Translation

0.0244

BigBench Language Identification

0.5219

BigBench Conceptual Combinations

0.6990

BigBench Elementary Math QA

0.3431

BigBench Dyck Languages

0.4930

AGI Eval LSAT AR

0.2435

BigBench CS Algorithms

0.6121

BigBench Logical Deduction

0.3620

BigBench Operators

0.4857

BigBench Repeat Copy Logic

0.4063

Simple Arithmetic (no spaces)

0.2940

Simple Arithmetic (with spaces)

0.3110

MathQA

0.3098

LogiQA

0.4132

PubMedQA

0.7060

SQuAD

0.5856

AGI Eval LSAT RC

0.6716

AGI Eval LSAT LR

0.5392

CoQA

0.4074

BigBench Understanding Fables

0.6825

BoolQ

0.8343

AGI Eval SAT EN

0.7670

Winogender MC (Female)

0.6000

Winogender MC (Male)

0.5500

Enterprise PII Classification

0.7676

BBQ

0.6912

GPQA Main

0.2612

GPQA Diamond

0.2475

Note: All scores are presented as decimal values between 0 and 1, representing the proportion of correct answers or the model's performance on each task.

[](#comparison)Comparison
-------------------------

Below are comparisions of this model with other models in the 7B regime.

Model

Params

Tokens

Open dataset?

CORE

MMLU

EXTENDED

**Open weights, closed datasets**

Llama2

7B

2T



49.2

45.8

34.1

DeepSeek

7B

2T



50.7

48.5

35.3

Mistral-0.3

7B

?



57.0

62.7

45.1

QWEN-2

7B

?



57.5

**71.9**

50.5

Llama3

8B

15T



57.6

66.2

46.3

Gemma

8B

6T



57.8

64.3

44.6

Phi-3

7B

?



**61.0**

69.9

**57.9**

**Open weights, open datasets**

Falcon

7B

1T



44.1

27.4

25.1

OLMo-1.7

7B

2.1T



47.0

54.0

34.2

MAP-Neo

7B

4.5T



**50.2**

**57.1**

**40.4**

**DCLM-7B**

7B

2.5T



**56.1**

**63.7**

**43.6**

[](#limitations-and-biases)Limitations and Biases
-------------------------------------------------

While DCLM-Baseline-7B demonstrates strong performance across a range of tasks, it's important to note:

1.  The model may exhibit biases present in its training data, which is derived from web crawl data.
2.  It has not undergone specific alignment or safety fine-tuning, so outputs should be used with caution.
3.  Performance on tasks not included in the evaluation suite may vary.
4.  The model's knowledge is limited to its training data cutoff date.

[](#ethical-considerations)Ethical Considerations
-------------------------------------------------

Users should be aware that this model, like all large language models, can potentially generate harmful or biased content. It should not be used for making decisions about individuals or in sensitive applications without appropriate safeguards and human oversight.

[](#citation)Citation
---------------------

If you use this model in your research, please cite:

    @article{Li2024DataCompLM,
      title={DataComp-LM: In search of the next generation of training sets for language models},
      author={Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and [... full author list]},
      journal={arXiv preprint arXiv:2406.11794},
      year={2024}
    }

## Model overview

The `DCLM-7B` is a 7 billion parameter language model trained by the DataComp for Language Models (DCLM) team. It is a decoder-only Transformer model designed to showcase the effectiveness of systematic data curation techniques for improving language model performance. The model was trained on the DCLM-Baseline dataset, which contains approximately 2.5 trillion tokens. 

Similar models include the [DeciLM-7B](https://aimodels.fyi/models/huggingFace/decilm-7b-deci), a highly efficient 7 billion parameter decoder-only text generation model released by Deci, and the [OpenELM-3B](https://aimodels.fyi/models/huggingFace/openelm-3b-apple), a family of open-source efficient language models developed by researchers at Apple.

## Model inputs and outputs

### Inputs
- Text prompts of varying lengths

### Outputs
- Continuation of the input text, generated in an autoregressive manner
- The model can be used for a variety of text generation tasks, such as creative writing, summarization, and question answering

## Capabilities

The `DCLM-7B` model demonstrates strong performance across a range of language understanding and generation benchmarks, including the ARC, HellaSwag, and MMLU tasks. It is particularly effective at tasks that require comprehensive language understanding and reasoning abilities.

## What can I use it for?

The `DCLM-7B` model can be used for a variety of natural language processing tasks, such as content generation, question answering, and dialogue systems. Its large size and strong performance make it well-suited for commercial and research applications that require high-quality text generation. 

Users and developers should keep in mind that, like all large language models, the `DCLM-7B` may produce outputs that are inaccurate, biased, or objectionable. Thorough safety testing and filtering mechanisms are recommended before deploying the model in production environments.

## Things to try

Experiment with the model's zero-shot and few-shot capabilities on various language understanding and generation tasks. Try fine-tuning the model on domain-specific datasets to see how it adapts to specialized applications. Explore the model's ability to handle long-form text generation and multi-turn dialogue.

[](#model-card-for-mistral-large-instruct-2407)Model Card for Mistral-Large-Instruct-2407
=========================================================================================

Mistral-Large-Instruct-2407 is an advanced dense Large Language Model (LLM) of 123B parameters with state-of-the-art reasoning, knowledge and coding capabilities.

For more details about this model please refer to our release [blog post](https://mistral.ai/news/mistral-large-2407/).

[](#key-features)Key features
-----------------------------

*   **Multi-lingual by design:** Dozens of languages supported, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch and Polish.
*   **Proficient in coding:** Trained on 80+ coding languages such as Python, Java, C, C++, Javacsript, and Bash. Also trained on more specific languages such as Swift and Fortran.
*   **Agentic-centric:** Best-in-class agentic capabilities with native function calling and JSON outputting.
*   **Advanced Reasoning:** State-of-the-art mathematical and reasoning capabilities.
*   **Mistral Research License:** Allows usage and modification for research and non-commercial usages.
*   **Large Context:** A large 128k context window.

[](#metrics)Metrics
-------------------

### [](#base-pretrained-benchmarks)Base Pretrained Benchmarks

Benchmark

Score

MMLU

84.0%

### [](#base-pretrained-multilingual-benchmarks-mmlu)Base Pretrained Multilingual Benchmarks (MMLU)

Benchmark

Score

French

82.8%

German

81.6%

Spanish

82.7%

Italian

82.7%

Dutch

80.7%

Portuguese

81.6%

Russian

79.0%

Korean

60.1%

Japanese

78.8%

Chinese

74.8%

### [](#instruction-benchmarks)Instruction Benchmarks

Benchmark

Score

MT Bench

8.63

Wild Bench

56.3

Arena Hard

73.2

### [](#code--reasoning-benchmarks)Code & Reasoning Benchmarks

Benchmark

Score

Human Eval

92%

Human Eval Plus

87%

MBPP Base

80%

MBPP Plus

69%

### [](#math-benchmarks)Math Benchmarks

Benchmark

Score

GSM8K

93%

Math Instruct (0-shot, no CoT)

70%

Math Instruct (0-shot, CoT)

71.5%

[](#usage)Usage
---------------

The model can be used with two different frameworks

*   [`mistral_inference`](https://github.com/mistralai/mistral-inference): See [here](#mistral-inference)
*   [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)

### [](#mistral-inference)Mistral Inference

#### [](#install)Install

It is recommended to use `mistralai/Mistral-Large-Instruct-2407` with [mistral-inference](https://github.com/mistralai/mistral-inference). For HF transformers code snippets, please keep scrolling.

    pip install mistral_inference
    

#### [](#download)Download

    from huggingface_hub import snapshot_download
    from pathlib import Path
    
    mistral_models_path = Path.home().joinpath('mistral_models', 'Large')
    mistral_models_path.mkdir(parents=True, exist_ok=True)
    
    snapshot_download(repo_id="mistralai/Mistral-Large-Instruct-2407", allow_patterns=["params.json", "consolidated-*.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)
    

#### [](#chat)Chat

After installing `mistral_inference`, a `mistral-chat` CLI command should be available in your environment. Given the size of this model, you will need a node with several GPUs (more than 300GB cumulated vRAM). If you have 8 GPUs on your machine, you can chat with the model using

    torchrun --nproc-per-node 8 --no-python mistral-chat $HOME/mistral_models/Large --instruct --max_tokens 256 --temperature 0.7
    

_E.g._ Try out something like:

    How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar.
    

#### [](#instruct-following)Instruct following

    from mistral_inference.transformer import Transformer
    from mistral_inference.generate import generate
    
    from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
    from mistral_common.protocol.instruct.messages import UserMessage
    from mistral_common.protocol.instruct.request import ChatCompletionRequest
    
    tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tokenizer.model.v3")
    model = Transformer.from_folder(mistral_models_path)
    
    prompt = "How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar."
    
    completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])
    
    tokens = tokenizer.encode_chat_completion(completion_request).tokens
    
    out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.7, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
    result = tokenizer.decode(out_tokens[0])
    
    print(result)
    

#### [](#function-calling)Function calling

    from mistral_common.protocol.instruct.tool_calls import Function, Tool
    from mistral_inference.transformer import Transformer
    from mistral_inference.generate import generate
    
    from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
    from mistral_common.protocol.instruct.messages import UserMessage
    from mistral_common.protocol.instruct.request import ChatCompletionRequest
    
    
    tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tokenizer.model.v3")
    model = Transformer.from_folder(mistral_models_path)
    
    completion_request = ChatCompletionRequest(
        tools=[
            Tool(
                function=Function(
                    name="get_current_weather",
                    description="Get the current weather",
                    parameters={
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The city and state, e.g. San Francisco, CA",
                            },
                            "format": {
                                "type": "string",
                                "enum": ["celsius", "fahrenheit"],
                                "description": "The temperature unit to use. Infer this from the users location.",
                            },
                        },
                        "required": ["location", "format"],
                    },
                )
            )
        ],
        messages=[
            UserMessage(content="What's the weather like today in Paris?"),
            ],
    )
    
    tokens = tokenizer.encode_chat_completion(completion_request).tokens
    
    out_tokens, _ = generate([tokens], model, max_tokens=256, temperature=0.7, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
    result = tokenizer.decode(out_tokens[0])
    
    print(result)
    

### [](#transformers)Transformers

If you want to use Hugging Face `transformers` to generate text, you can do something like this.

    from transformers import pipeline
    
    messages = [
        {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
        {"role": "user", "content": "Who are you?"},
    ]
    chatbot = pipeline("text-generation", model="mistralai/Mistral-Large-Instruct-2407")
    chatbot(messages)
    

[](#limitations)Limitations
---------------------------

The Mistral Large model is a quick demonstration that the base model can be easily fine-tuned to achieve compelling performance. It does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.

[](#the-mistral-ai-team)The Mistral AI Team
-------------------------------------------

Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Alok Kothari, Antoine Roux, Arthur Mensch, Audrey Herblin-Stoop, Augustin Garreau, Austin Birky, Bam4d, Baptiste Bout, Baudouin de Monicault, Blanche Savary, Carole Rambaud, Caroline Feldman, Devendra Singh Chaplot, Diego de las Casas, Diogo Costa, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, Gaspard Blanchet, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, Henri Roussez, Hichem Sattouf, Ian Mack, Jean-Malo Delignon, Jessica Chudnovsky, Justus Murke, Kartik Khandelwal, Lawrence Stewart, Louis Martin, Louis Ternon, Lucile Saulnier, Llio Renard Lavaud, Margaret Jennings, Marie Pellat, Marie Torelli, Marie-Anne Lachaux, Marjorie Janiewicz, Mickal Seznec, Nicolas Schuhl, Niklas Muhs, Olivier de Garrigues, Patrick von Platen, Paul Jacob, Pauline Buche, Pavan Kumar Reddy, Perry Savas, Pierre Stock, Romain Sauvestre, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Sophia Yang, Szymon Antoniak, Teven Le Scao, Thibault Schueller, Thibaut Lavril, Thomas Wang, Thophile Gervet, Timothe Lacroix, Valera Nemychnikova, Wendy Shang, William El Sayed, William Marshall

## Model Overview

`Mistral-Large-Instruct-2407` is an advanced 123B parameter dense Large Language Model (LLM) developed by [Mistral AI](https://aimodels.fyi/creators/huggingFace/mistralai). It has state-of-the-art reasoning, knowledge, and coding capabilities, and is designed to be multilingual, supporting dozens of languages including English, French, German, and Chinese.

Compared to similar Mistral models like the [Mistral-7B-Instruct-v0.2](https://aimodels.fyi/models/huggingFace/mistral-7b-instruct-v02-mistralai) and [Mistral-7B-Instruct-v0.1](https://aimodels.fyi/models/huggingFace/mistral-7b-instruct-v01-mistralai), the `Mistral-Large-Instruct-2407` offers significantly more parameters and advanced capabilities. It boasts strong performance on benchmarks like MMLU (84.0% overall) and specialized benchmarks for coding, math, and reasoning.

## Model Inputs and Outputs

The `Mistral-Large-Instruct-2407` model can handle a wide variety of inputs, from natural language prompts to structured formats like JSON. It is particularly adept at processing code-related inputs, having been trained on over 80 programming languages.

### Inputs
- **Natural language prompts**: The model can accept freeform text prompts on a wide range of topics.
- **Code snippets**: The model can understand and process code in multiple programming languages.
- **Structured data**: The model can ingest and work with JSON and other structured data formats.

### Outputs
- **Natural language responses**: The model can generate human-like responses to prompts in a variety of languages.
- **Code generation**: The model can produce working code to solve problems or implement functionality.
- **Structured data**: The model can output results in JSON and other structured formats.

## Capabilities

The `Mistral-Large-Instruct-2407` model excels at a wide range of tasks, from general knowledge and reasoning to specialized applications like coding and mathematical problem-solving. Its advanced capabilities are demonstrated by its strong performance on benchmarks like MMLU, MT Bench, and Human Eval.

Some key capabilities of the model include:

- **Multilingual proficiency**: The model can understand and generate text in dozens of languages, making it useful for global applications.
- **Coding expertise**: The model's training on over 80 programming languages allows it to understand, write, and debug code with a high level of competence.
- **Advanced reasoning**: The model's strong performance on math and reasoning benchmarks showcases its ability to tackle complex cognitive tasks.
- **Agentic functionality**: The model can call native functions and output structured data, enabling it to be integrated into more sophisticated applications.

## What Can I Use It For?

The `Mistral-Large-Instruct-2407` model's diverse capabilities make it a versatile tool for a wide range of applications. Some potential use cases include:

- **Multilingual chatbots and virtual assistants**: The model's multilingual abilities can power conversational AI systems that can engage with users in their preferred language.
- **Automated code generation and debugging**: Developers can leverage the model's coding expertise to speed up software development tasks, from prototyping to troubleshooting.
- **Intelligent document processing**: The model can be used to extract insights and generate summaries from complex, multilingual documents.
- **Scientific and mathematical modeling**: The model's strong reasoning skills can be applied to solve advanced problems in fields like finance, engineering, and research.

## Things to Try

Given the `Mistral-Large-Instruct-2407` model's broad capabilities, there are many interesting things to explore and experiment with. Some ideas include:

- **Multilingual knowledge transfer**: Test the model's ability to translate and apply knowledge across languages by prompting it in one language and asking for responses in another.
- **Code generation and optimization**: Challenge the model to generate efficient, working code to solve complex programming tasks, and observe how it optimizes the solutions.
- **Multimodal integration**: Explore ways to combine the model's language understanding with other modalities, such as images or structured data, to create more powerful AI systems.
- **Open-ended reasoning**: Probe the model's general intelligence by presenting it with open-ended, abstract problems and observing the quality and creativity of its responses.

By pushing the boundaries of what the `Mistral-Large-Instruct-2407` model can do, developers and researchers can uncover new insights and applications for this powerful AI system.

## Model overview

The `Meta-Llama-3.1-405B` is a large language model (LLM) developed by Meta as part of the Meta Llama 3.1 collection of multilingual LLMs. The Llama 3.1 collection includes models in 8B, 70B, and 405B sizes, all of which are optimized for multilingual dialogue use cases and outperform many available open-source and closed chat models on common industry benchmarks. The 405B version is the largest in the Llama 3.1 family.

Llama 3.1 models are built using an optimized transformer architecture and are trained on a new mix of publicly available online data. The tuned versions, including the `Meta-Llama-3.1-405B`, utilize supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align the models with human preferences for helpfulness and safety.

Similar models in the Llama 3.1 collection include the [Meta-Llama-3.1-8B](https://aimodels.fyi/models/huggingFace/meta-llama-31-8b-meta-llama) and [Meta-Llama-3.1-405B-Instruct](https://aimodels.fyi/models/huggingFace/meta-llama-31-405b-instruct-meta-llama), which offer different parameter sizes and tuning approaches.

## Model inputs and outputs

### Inputs
- **Multilingual Text**: The `Meta-Llama-3.1-405B` model can accept text input in 8 supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

### Outputs
- **Multilingual Text and Code**: The model can generate text and code output in the same 8 supported languages.
- The model has a context length of 128k tokens.

## Capabilities

The `Meta-Llama-3.1-405B` model is capable of a wide range of natural language processing tasks, including dialogue, text generation, and code generation. It outperforms many industry benchmarks, demonstrating strong performance in areas like multitask learning, reading comprehension, and reasoning.

## What can I use it for?

The `Meta-Llama-3.1-405B` model is intended for commercial and research use cases that require multilingual language understanding and generation capabilities. Some potential applications include:

- Building multilingual chatbots and virtual assistants
- Generating content in multiple languages for marketing, education, or other domains
- Enabling cross-lingual information retrieval and translation
- Developing multilingual natural language interfaces for software applications

The [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) allows for these use cases and more.

## Things to try

One interesting aspect of the `Meta-Llama-3.1-405B` model is its ability to handle longer context lengths of up to 128k tokens. This can be useful for applications that require understanding and generating coherent text over extended passages, such as summarization, dialogue, or creative writing. Developers may want to experiment with leveraging this extended context to see how it impacts the model's performance on their specific use cases.

Additionally, the multilingual capabilities of the Llama 3.1 models present opportunities to explore cross-lingual knowledge transfer and zero-shot learning. Developers could try fine-tuning the `Meta-Llama-3.1-405B` on tasks in one language and evaluating its performance on related tasks in other supported languages, or using the model for multilingual information retrieval and question answering.

## Model overview

The `Meta-Llama-3.1-405B-Instruct` is a large language model developed by Meta that is part of the Meta Llama 3.1 collection of multilingual LLMs. It is an 405B parameter auto-regressive model that has been optimized for multilingual dialogue use cases through supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). The Llama 3.1 family includes models of 8B, 70B, and 405B sizes, all supporting 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Similar models in the Llama 3.1 family include the [Meta-Llama-3.1-8B-Instruct](https://aimodels.fyi/models/huggingFace/meta-llama-31-8b-instruct-meta-llama) and [Meta-Llama-3.1-70B-Instruct](https://aimodels.fyi/models/huggingFace/meta-llama-3-70b-instruct-meta-llama). These models share the same architectural design and training approach, but differ in parameter count and performance characteristics.

## Model inputs and outputs

### Inputs
- Multilingual text in the 8 supported languages

### Outputs
- Multilingual text and code in the 8 supported languages

## Capabilities

The `Meta-Llama-3.1-405B-Instruct` model excels at a variety of natural language generation tasks, particularly in multilingual dialogue scenarios. It demonstrates strong performance on benchmarks like MMLU, CommonSenseQA, and ARC-Challenge, outperforming many open-source and proprietary chat models. The model's ability to generate coherent and helpful responses in multiple languages makes it a valuable tool for building multilingual virtual assistants, translation services, and other multilingual applications.

## What can I use it for?

The `Meta-Llama-3.1-405B-Instruct` model is well-suited for a wide range of commercial and research use cases, including:

- Multilingual chatbots and virtual assistants
- Multilingual content generation (e.g. articles, stories, product descriptions)
- Multilingual translation and language understanding services
- Multilingual code generation and programming assistance

The [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) allows for these use cases and more, providing a flexible framework for developers to leverage the model's capabilities.

## Things to try

One interesting aspect of the `Meta-Llama-3.1-405B-Instruct` model is its ability to generate coherent responses in multiple languages. Developers could experiment with prompts that require the model to switch between languages, or that ask the model to translate between languages. Another interesting direction would be to fine-tune the model further for specific multilingual tasks, such as multilingual Q&A or multilingual code generation, to push the boundaries of its capabilities.

## Model overview

The `Meta-Llama-3.1-8B` is a large language model (LLM) developed by Meta. It is part of the Meta Llama 3.1 collection of pretrained and instruction-tuned generative models in 8B, 70B, and 405B sizes. The Llama 3.1 instruction-tuned text-only models are optimized for multilingual dialogue use cases and outperform many available open-source and closed chat models on common industry benchmarks. The model uses an optimized transformer architecture and was trained using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Similar models in the Llama 3.1 family include the [Meta-Llama-3.1-405B-Instruct](https://aimodels.fyi/models/huggingFace/meta-llama-31-405b-instruct-meta-llama) and the [Meta-Llama-3.1-8B-Instruct](https://aimodels.fyi/models/huggingFace/meta-llama-31-8b-instruct-meta-llama), which provide different model sizes and levels of instruction tuning.

## Model inputs and outputs

### Inputs
- **Multilingual Text**: The model accepts input text in multiple languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
- **Multilingual Code**: The model can also accept input code in these supported languages.

### Outputs
- **Multilingual Text**: The model generates output text in the same supported languages as the inputs.
- **Multilingual Code**: The model can output code in the supported languages.

## Capabilities

The `Meta-Llama-3.1-8B` model is capable of engaging in multilingual dialogue, answering questions, and generating text and code across a variety of domains. It has demonstrated strong performance on industry benchmarks such as MMLU, CommonSenseQA, and HumanEval, outperforming many open-source and closed-source chat models.

## What can I use it for?

The `Meta-Llama-3.1-8B` model is intended for commercial and research use in the supported languages. The instruction-tuned versions are well-suited for assistant-like chat applications, while the pretrained models can be adapted for a range of natural language generation tasks. The model collection also supports the ability to leverage the outputs to improve other models, including through synthetic data generation and distillation.

## Things to try

Some interesting things to try with the `Meta-Llama-3.1-8B` model include exploring its multilingual capabilities, testing its performance on domain-specific tasks, and experimenting with ways to fine-tune or adapt the model for your specific use case. The [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) and [Responsible Use Guide](https://llama.meta.com/responsible-use-guide/) provide helpful guidance on responsible development and deployment of the model.

## Model overview

The `Meta-Llama-3.1-70B` is a part of the Meta Llama 3.1 collection of multilingual large language models (LLMs) developed by [Meta](https://aimodels.fyi/creators/huggingFace/meta-llama). This 70B parameter model is a pretrained and instruction-tuned generative model that supports text input and text output in multiple languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. It was trained on a new mix of publicly available online data and utilizes an optimized transformer architecture.

Similar models in the Llama 3.1 family include the [Meta-Llama-3.1-8B](https://aimodels.fyi/models/huggingFace/meta-llama-31-8b-meta-llama) and [Meta-Llama-3.1-405B](https://aimodels.fyi/models/huggingFace/meta-llama-31-405b-meta-llama), which vary in their parameter counts and performance characteristics. All Llama 3.1 models use Grouped-Query Attention (GQA) for improved inference scalability.

## Model inputs and outputs

### Inputs
- **Multilingual Text**: The `Meta-Llama-3.1-70B` model accepts text input in any of the 8 supported languages.
- **Multilingual Code**: In addition to natural language, the model can also process code snippets in various programming languages.

### Outputs
- **Multilingual Text**: The model can generate text output in any of the 8 supported languages.
- **Multilingual Code**: The model is capable of producing code output in addition to natural language.

## Capabilities

The `Meta-Llama-3.1-70B` model is designed for a variety of natural language generation tasks, including assistant-like chat, translation, and even code generation. Its strong performance on industry benchmarks across general knowledge, reasoning, reading comprehension, and other domains demonstrates its broad capabilities.

## What can I use it for?

The `Meta-Llama-3.1-70B` model is intended for commercial and research use in multiple languages. Developers can leverage its text generation abilities to build chatbots, virtual assistants, and other language-based applications. The model's versatility also allows it to be adapted for tasks like content creation, text summarization, and even data augmentation through synthetic data generation.

## Things to try

One interesting aspect of the `Meta-Llama-3.1-70B` model is its ability to handle multilingual inputs and outputs. Developers can experiment with using the model to translate between the supported languages, or to generate text that seamlessly incorporates multiple languages. Additionally, the model's strong performance on coding-related benchmarks suggests that it could be a valuable tool for building code-generating assistants or integrating code generation capabilities into various applications.

[](#model-card-for-mistral-nemo-base-2407)Model Card for Mistral-Nemo-Base-2407
===============================================================================

The Mistral-Nemo-Base-2407 Large Language Model (LLM) is a pretrained generative text model of 12B parameters trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size.

For more details about this model please refer to our release [blog post](https://mistral.ai/news/mistral-nemo/).

[](#key-features)Key features
-----------------------------

*   Released under the **Apache 2 License**
*   Pre-trained and instructed versions
*   Trained with a **128k context window**
*   Trained on a large proportion of **multilingual and code data**
*   Drop-in replacement of Mistral 7B

[](#model-architecture)Model Architecture
-----------------------------------------

Mistral Nemo is a transformer model, with the following architecture choices:

*   **Layers:** 40
*   **Dim:** 5,120
*   **Head dim:** 128
*   **Hidden dim:** 14,436
*   **Activation Function:** SwiGLU
*   **Number of heads:** 32
*   **Number of kv-heads:** 8 (GQA)
*   **Vocabulary size:** 2\*\*17 ~= 128k
*   **Rotary embeddings (theta = 1M)**

[](#metrics)Metrics
-------------------

### [](#main-benchmarks)Main Benchmarks

Benchmark

Score

HellaSwag (0-shot)

83.5%

Winogrande (0-shot)

76.8%

OpenBookQA (0-shot)

60.6%

CommonSenseQA (0-shot)

70.4%

TruthfulQA (0-shot)

50.3%

MMLU (5-shot)

68.0%

TriviaQA (5-shot)

73.8%

NaturalQuestions (5-shot)

31.2%

### [](#multilingual-benchmarks-mmlu)Multilingual Benchmarks (MMLU)

Language

Score

French

62.3%

German

62.7%

Spanish

64.6%

Italian

61.3%

Portuguese

63.3%

Russian

59.2%

Chinese

59.0%

Japanese

59.0%

[](#usage)Usage
---------------

The model can be used with three different frameworks

*   [`mistral_inference`](https://github.com/mistralai/mistral-inference): See [here](#mistral-inference)
*   [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)
*   [`NeMo`](https://github.com/NVIDIA/NeMo): See [nvidia/Mistral-NeMo-12B-Base](https://huggingface.co/nvidia/Mistral-NeMo-12B-Base)

### [](#mistral-inference)Mistral Inference

#### [](#install)Install

It is recommended to use `mistralai/Mistral-Nemo-Base-2407` with [mistral-inference](https://github.com/mistralai/mistral-inference). For HF transformers code snippets, please keep scrolling.

    pip install mistral_inference
    

#### [](#download)Download

    from huggingface_hub import snapshot_download
    from pathlib import Path
    
    mistral_models_path = Path.home().joinpath('mistral_models', 'Nemo-v0.1')
    mistral_models_path.mkdir(parents=True, exist_ok=True)
    
    snapshot_download(repo_id="mistralai/Mistral-Nemo-Base-2407", allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], local_dir=mistral_models_path)
    

#### [](#demo)Demo

After installing `mistral_inference`, a `mistral-demo` CLI command should be available in your environment.

    mistral-demo $HOME/mistral_models/Nemo-v0.1
    

### [](#transformers)Transformers

> NOTE: Until a new release has been made, you need to install transformers from source:
> 
>     pip install git+https://github.com/huggingface/transformers.git
>     

If you want to use Hugging Face `transformers` to generate text, you can do something like this.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_id = "mistralai/Mistral-Nemo-Base-2407"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    
    model = AutoModelForCausalLM.from_pretrained(model_id)
    inputs = tokenizer("Hello my name is", return_tensors="pt")
    
    outputs = model.generate(**inputs, max_new_tokens=20)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    

> Unlike previous Mistral models, Mistral Nemo requires smaller temperatures. We recommend to use a temperature of 0.3.

[](#note)Note
-------------

`Mistral-Nemo-Base-2407` is a pretrained base model and therefore does not have any moderation mechanisms.

[](#the-mistral-ai-team)The Mistral AI Team
-------------------------------------------

Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Alok Kothari, Antoine Roux, Arthur Mensch, Audrey Herblin-Stoop, Augustin Garreau, Austin Birky, Bam4d, Baptiste Bout, Baudouin de Monicault, Blanche Savary, Carole Rambaud, Caroline Feldman, Devendra Singh Chaplot, Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, Gaspard Blanchet, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, Henri Roussez, Hichem Sattouf, Ian Mack, Jean-Malo Delignon, Jessica Chudnovsky, Justus Murke, Kartik Khandelwal, Lawrence Stewart, Louis Martin, Louis Ternon, Lucile Saulnier, Llio Renard Lavaud, Margaret Jennings, Marie Pellat, Marie Torelli, Marie-Anne Lachaux, Marjorie Janiewicz, Mickal Seznec, Nicolas Schuhl, Niklas Muhs, Olivier de Garrigues, Patrick von Platen, Paul Jacob, Pauline Buche, Pavan Kumar Reddy, Perry Savas, Pierre Stock, Romain Sauvestre, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Sophia Yang, Szymon Antoniak, Teven Le Scao, Thibault Schueller, Thibaut Lavril, Thomas Wang, Thophile Gervet, Timothe Lacroix, Valera Nemychnikova, Wendy Shang, William El Sayed, William Marshall

## Model Overview

The `Mistral-Nemo-Base-2407` is a 12 billion parameter Large Language Model (LLM) jointly developed by Mistral AI and NVIDIA. It significantly outperforms existing models of similar size, thanks to its large training dataset that includes a high proportion of multilingual and code data. The model is released under the Apache 2 License and offers both pre-trained and instructed versions.

Compared to similar models from Mistral, such as the [Mistral-7B-v0.1](https://aimodels.fyi/models/huggingFace/mistral-7b-v01-mistralai) and [Mistral-7B-v0.3](https://aimodels.fyi/models/huggingFace/mistral-7b-v03-mistralai), the Mistral-Nemo-Base-2407 has more than 12 billion parameters and a larger 128k context window. It also incorporates architectural choices like Grouped-Query Attention, Sliding-Window Attention, and a Byte-fallback BPE tokenizer.

## Model Inputs and Outputs

The `Mistral-Nemo-Base-2407` is a text-to-text model, meaning it takes text as input and generates text as output. The model can be used for a variety of natural language processing tasks, such as language generation, text summarization, and question answering.

### Inputs
- Text prompts

### Outputs
- Generated text

## Capabilities

The `Mistral-Nemo-Base-2407` model has demonstrated strong performance on a range of benchmarks, including HellaSwag, Winogrande, OpenBookQA, CommonSenseQA, TruthfulQA, and MMLU. It also exhibits impressive multilingual capabilities, scoring well on MMLU benchmarks across multiple languages such as French, German, Spanish, Italian, Portuguese, Russian, Chinese, and Japanese.

## What Can I Use It For?

The `Mistral-Nemo-Base-2407` model can be used for a variety of natural language processing tasks, such as:

- **Content Generation**: The model can be used to generate high-quality text, such as articles, stories, or product descriptions.
- **Question Answering**: The model can be used to answer questions on a wide range of topics, making it useful for building conversational agents or knowledge-sharing applications.
- **Text Summarization**: The model can be used to summarize long-form text, such as news articles or research papers, into concise and informative summaries.
- **Code Generation**: The model's training on a large proportion of code data makes it a potential candidate for tasks like code completion or code generation.

## Things to Try

One interesting aspect of the `Mistral-Nemo-Base-2407` model is its large 128k context window, which allows it to maintain coherence and understanding over longer stretches of text. This could be particularly useful for tasks that require reasoning over extended context, such as multi-step problem-solving or long-form dialogue.

Researchers and developers may also want to explore the model's multilingual capabilities and see how it performs on specialized tasks or domains that require cross-lingual understanding or generation.

![DeepSeek-V2](https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true)

* * *

 [![Homepage](https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true)](https://www.deepseek.com/)[![Chat](https://img.shields.io/badge/%20Chat-DeepSeek%20V2-536af5?color=536af5&logoColor=white) ](https://chat.deepseek.com/)[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white)](https://huggingface.co/deepseek-ai)

 [![Discord](https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da)](https://discord.gg/Tc7c45Zzu5)[![Wechat](https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white) ](https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/qr.jpeg?raw=true)[![Twitter Follow](https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white)](https://twitter.com/deepseek_ai)

 [![Code License](https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53)](https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-CODE)[![Model License](https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53)](https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-MODEL)

[Model Download](#2-model-downloads) | [Evaluation Results](#3-evaluation-results) | [Model Architecture](#4-model-architecture) | [API Platform](#6-api-platform) | [License](#8-license) | [Citation](#9-citation)

[**Paper Link**](https://arxiv.org/abs/2405.04434)

[](#deepseek-v2-chat-0628)DeepSeek-V2-Chat-0628
===============================================

[](#1-introduction)1\. Introduction
-----------------------------------

DeepSeek-V2-Chat-0628 is an improved version of DeepSeek-V2-Chat. For model details, please visit [DeepSeek-V2 page](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat) for more information.

DeepSeek-V2-Chat-0628 has achieved remarkable performance on the LMSYS Chatbot Arena Leaderboard:

Overall Ranking: #11, outperforming all other open-source models.

![](/deepseek-ai/DeepSeek-V2-Chat-0628/resolve/main/figures/arena1.jpeg)

Coding Arena Ranking: #3, showcasing exceptional capabilities in coding tasks.

![](/deepseek-ai/DeepSeek-V2-Chat-0628/resolve/main/figures/arena2.png)

Hard Prompts Arena Ranking: #3, demonstrating strong performance on challenging prompts.

![](/deepseek-ai/DeepSeek-V2-Chat-0628/resolve/main/figures/arena3.png)

[](#2-improvement)2\. Improvement
---------------------------------

Compared to the previous version DeepSeek-V2-Chat, the new version has made the following improvements:

**Benchmark**

**DeepSeek-V2-Chat**

**DeepSeek-V2-Chat-0628**

**Improvement**

**HumanEval**

81.1

84.8

+3.7

**MATH**

53.9

71.0

+17.1

**BBH**

79.7

83.4

+3.7

**IFEval**

63.8

77.6

+13.8

**Arena-Hard**

41.6

68.3

+26.7

**JSON Output (Internal)**

78

85

+7

Furthermore, the instruction following capability in the "system" area has been optimized, significantly enhancing the user experience for immersive translation, RAG, and other tasks.

[](#3-how-to-run-locally)3\. How to run locally
-----------------------------------------------

**To utilize DeepSeek-V2-Chat-0628 in BF16 format for inference, 80GB\*8 GPUs are required.**

### [](#inference-with-huggingfaces-transformers)Inference with Huggingface's Transformers

You can directly employ [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.

    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
    
    model_name = "deepseek-ai/DeepSeek-V2-Chat-0628"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    # `max_memory` should be set based on your devices
    max_memory = {i: "75GB" for i in range(8)}
    # `device_map` cannot be set to `auto`
    model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
    model.generation_config = GenerationConfig.from_pretrained(model_name)
    model.generation_config.pad_token_id = model.generation_config.eos_token_id
    
    messages = [
        {"role": "user", "content": "Write a piece of quicksort code in C++"}
    ]
    input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
    
    result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
    print(result)
    

The complete chat template can be found within `tokenizer_config.json` located in the huggingface model repository.

**Note: The chat template has been updated compared to the previous DeepSeek-V2-Chat version.**

An example of chat template is as belows:

    <beginofsentence><User>{user_message_1}<Assistant>{assistant_message_1}<endofsentence><User>{user_message_2}<Assistant>
    

You can also add an optional system message:

    <beginofsentence>{system_message}
    
    <User>{user_message_1}<Assistant>{assistant_message_1}<endofsentence><User>{user_message_2}<Assistant>
    

### [](#inference-with-vllm-recommended)Inference with vLLM (recommended)

To utilize [vLLM](https://github.com/vllm-project/vllm) for model inference, please merge this Pull Request into your vLLM codebase: [https://github.com/vllm-project/vllm/pull/4650](https://github.com/vllm-project/vllm/pull/4650).

    from transformers import AutoTokenizer
    from vllm import LLM, SamplingParams
    
    max_model_len, tp_size = 8192, 8
    model_name = "deepseek-ai/DeepSeek-V2-Chat-0628"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
    sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
    
    messages_list = [
        [{"role": "user", "content": "Who are you?"}],
        [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
        [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
    ]
    
    prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
    
    outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
    
    generated_text = [output.outputs[0].text for output in outputs]
    print(generated_text)
    

[](#4-license)4\. License
-------------------------

This code repository is licensed under [the MIT License](/deepseek-ai/DeepSeek-V2-Chat-0628/blob/main/LICENSE-CODE). The use of DeepSeek-V2 Base/Chat models is subject to [the Model License](/deepseek-ai/DeepSeek-V2-Chat-0628/blob/main/LICENSE-MODEL). DeepSeek-V2 series (including Base and Chat) supports commercial use.

[](#5-citation)5\. Citation
---------------------------

    @misc{deepseekv2,
          title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
          author={DeepSeek-AI},
          year={2024},
          eprint={2405.04434},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }
    

[](#6-contact)6\. Contact
-------------------------

If you have any questions, please raise an issue or contact us at [service@deepseek.com](/deepseek-ai/DeepSeek-V2-Chat-0628/blob/main/service@deepseek.com).

## Model overview

`DeepSeek-V2-Chat-0628` is an improved version of the `DeepSeek-V2-Chat` model, developed by [deepseek-ai](https://aimodels.fyi/creators/huggingFace/deepseek-ai). It is a text-to-text AI model that has achieved remarkable performance on the LMSYS Chatbot Arena Leaderboard, outperforming all other open-source models. Compared to the previous version, `DeepSeek-V2-Chat-0628` has made several key improvements, including significant boosts in benchmark scores across HumanEval, MATH, BBH, and IFEval datasets.

## Model inputs and outputs

### Inputs
- **Text prompts**: The model takes in text prompts as input, which can be instructions, questions, or any other type of text.

### Outputs
- **Generated text**: The model produces coherent and informative text outputs in response to the input prompts. The outputs can range from short responses to longer, more detailed text.

## Capabilities

`DeepSeek-V2-Chat-0628` has demonstrated exceptional capabilities in various tasks, such as coding, mathematical reasoning, and handling challenging prompts. It has achieved a #11 overall ranking on the LMSYS Chatbot Arena Leaderboard, a #3 ranking in the Coding Arena, and a #3 ranking in the Hard Prompts Arena.

## What can I use it for?

The strong performance of `DeepSeek-V2-Chat-0628` makes it a versatile tool for a wide range of applications. It can be used for tasks like code generation, question answering, text summarization, and creative writing. Developers and researchers can incorporate this model into their projects to enhance the capabilities of their AI-powered applications.

## Things to try

One interesting aspect of `DeepSeek-V2-Chat-0628` is its ability to handle challenging prompts and produce high-quality responses. You could try experimenting with the model by providing it with complex or ambiguous prompts and observe how it navigates and responds to such inputs. Additionally, you could explore the model's performance on domain-specific tasks, such as technical writing or scientific problem-solving, to further understand its capabilities.