[](#mpt-7b)MPT-7B
=================

MPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code. This model was trained by [MosaicML](https://www.mosaicml.com).

MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.

These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)). Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence. MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).

This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry). It was trained by MosaicMLs NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for LLM pretraining, finetuning, and inference.

### [](#how-is-this-model-different)How is this model different?

MPT-7B is

*   **Licensed for the possibility of commercial use** (unlike [LLaMA](https://arxiv.org/abs/2302.13971)).
*   **Trained on a large amount of data** (1T tokens like [LLaMA](https://arxiv.org/abs/2302.13971) vs. 300B for [Pythia](https://github.com/EleutherAI/pythia), 300B for [OpenLLaMA](https://github.com/openlm-research/open_llama), and 800B for [StableLM](https://github.com/Stability-AI/StableLM)).
*   **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409) (we finetuned [MPT-7B-StoryWriter-65k+](https://huggingface.co/mosaicml/mpt-7b-storywriter) on up to 65k inputs and can handle up to 84k vs. 2k-4k for other open source models).
*   **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
*   **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)

### [](#models-finetuned-off-mpt-7b)Models finetuned off MPT-7B:

The following models are finetuned on MPT-7B:

*   [MPT-7B-StoryWriter-65k+](https://huggingface.co/mosaicml/mpt-7b-storywriter): a model designed to read and write fictional stories with super long context lengths. Built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the [books3 dataset](https://huggingface.co/datasets/the_pile_books3). At inference time, thanks to [ALiBi](https://arxiv.org/abs/2108.12409), MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens. We demonstrate generations as long as 80k tokens on a single A100-80GB GPU in our [blogpost](/mosaicml/mpt-7b/blob/main/www.mosaicml.com/blog/mpt-7b).
    
    *   License: Apache 2.0
*   [MPT-7B-Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct): a model for short-form instruction following. Built by finetuning MPT-7B on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) we also release, derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
    
    *   License: Apache 2.0
*   [MPT-7B-Chat](https://huggingface.co/mosaicml/mpt-7b-chat): a chatbot-like model for dialogue generation. Built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3), [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets.
    
    *   License: _CC-By-NC-SA-4.0_

[](#model-date)Model Date
-------------------------

May 5, 2023

[](#model-license)Model License
-------------------------------

Apache-2.0

[](#documentation)Documentation
-------------------------------

*   [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](https://www.mosaicml.com/blog/mpt-7b)
*   [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
*   Questions: Feel free to contact us via the [MosaicML Community Slack](https://mosaicml.me/slack)!

[](#how-to-use)How to Use
-------------------------

This model is best used with the MosaicML [llm-foundry repository](https://github.com/mosaicml/llm-foundry) for training and finetuning.

    import transformers
    model = transformers.AutoModelForCausalLM.from_pretrained(
      'mosaicml/mpt-7b',
      trust_remote_code=True
    )
    

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package. `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.

To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:

    import torch
    import transformers
    
    name = 'mosaicml/mpt-7b'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.attn_config['attn_impl'] = 'triton'
    config.init_device = 'cuda:0' # For fast initialization directly on GPU!
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      torch_dtype=torch.bfloat16, # Load model weights in bfloat16
      trust_remote_code=True
    )
    

Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:

    import transformers
    
    name = 'mosaicml/mpt-7b'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      trust_remote_code=True
    )
    

This model was trained with the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
    

The model can then be used, for example, within a text-generation pipeline.  
Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).

    from transformers import pipeline
    
    pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
    
    with torch.autocast('cuda', dtype=torch.bfloat16):
        print(
            pipe('Here is a recipe for vegan banana bread:\n',
                max_new_tokens=100,
                do_sample=True,
                use_cache=True))
    

[](#model-description)Model Description
---------------------------------------

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

*   It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
*   It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
*   It does not use biases

Hyperparameter

Value

n\_parameters

6.7B

n\_layers

32

n\_heads

32

d\_model

4096

vocab size

50432

sequence length

2048

[](#training-data)Training Data
-------------------------------

### [](#streaming-datasets)Streaming Datasets

Data was formatted using the MosaicML [StreamingDataset](https://github.com/mosaicml/streaming) library to host our data in object storage and efficiently stream it to our compute cluster during training. StreamingDataset obviates the need to download the whole dataset before starting training, and allows instant resumption of training from any point in the dataset.

### [](#data-mix)Data Mix

The model was trained for 1T tokens (with batch size 1760 and sequence length 2048). It was trained on the following data mix:

Data Source

Number of Tokens in Source

Proportion

Effective Number of Tokens

Epochs

mC4 3.1.0 - English

417.99 B

0.33

330 B

0.14

C4 - English - SemDedup 80%

100.42 B

0.299

299 B

2.98

RedPajama - CommonCrawl

878.45 B

0.1

100 B

0.11

The Stack - Selected Languages

463.78 B

0.1

100 B

0.22

RedPajama - Wikipedia - En

4.87 B

0.04

40 B

8.21

The Stack - Markdown

107.07 B

0.035

35 B

0.33

S2ORC

48.85 B

0.033

33 B

0.68

RedPajama - Books

26.02 B

0.03

30B

1.15

RedPajama - arXiv

28.10 B

0.019

19 B

0.68

RedPajama - StackExchange

20.54 B

0.014

14 B

0.68

Samples for each batch were selected from one of the datasets with the probability specified above. The examples were shuffled within each dataset, and each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer. This BPE tokenizer has a number of desirable characteristics, most of which are relevant for tokenizing code: (1) It was trained on a diverse mix of data that includes code (The Pile) (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.

The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)), model flop utilization (MFU) increased by up to four percentage points.

### [](#training-configuration)Training Configuration

This model was trained on 440 A100-40GBs for about 9.5 days using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the [LION](https://arxiv.org/abs/2302.06675) optimizer.

[](#limitations-and-biases)Limitations and Biases
-------------------------------------------------

_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_

MPT-7B (Base) is **not** intended for deployment without finetuning. It should not be used for human-facing interactions without further guardrails and user consent.

MPT-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information. MPT-7B was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

[](#mosaicml-platform)MosaicML Platform
---------------------------------------

If you're interested in [training](https://www.mosaicml.com/training) and [deploying](https://www.mosaicml.com/inference) your own MPT or LLMs on the MosaicML Platform, [sign up here](https://forms.mosaicml.com/demo?utm_source=huggingface&utm_medium=referral&utm_campaign=mpt-7b).

[](#disclaimer)Disclaimer
-------------------------

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.

[](#citation)Citation
---------------------

Please cite this model using the following format:

    @online{MosaicML2023Introducing,
        author    = {MosaicML NLP Team},
        title     = {Introducing MPT-7B: A New Standard for Open-Source,
        Commercially Usable LLMs},
        year      = {2023},
        url       = {www.mosaicml.com/blog/mpt-7b},
        note      = {Accessed: 2023-05-05},
        urldate   = {2023-05-05}
    }

## Model overview

The `mpt-7b` is a large language model developed by MosaicML, a company focused on building efficient AI models. It is part of the MosaicPretrainedTransformer (MPT) family of models, which use a modified transformer architecture optimized for efficient training and inference. The model was trained on 1 trillion tokens of English text and code, making it one of the larger open-source language models available.

The key differences between `mpt-7b` and similar models like [LLaMA](https://aimodels.fyi/models/huggingFace/llama-2-70b-meta) and [Pythia](https://github.com/EleutherAI/pythia) are:

- It is licensed for commercial use, unlike LLaMA.
- It was trained on a significantly larger dataset of 1 trillion tokens, compared to 300 billion for Pythia and 800 billion for [StableLM](https://github.com/Stability-AI/StableLM).
- It can handle extremely long inputs of up to 84,000 tokens, thanks to the use of Attention with Linear Biases (ALiBi), compared to only 2,000-4,000 tokens for other open-source models.
- It is capable of fast training and inference, leveraging techniques like [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).

## Model inputs and outputs

### Inputs
- Text data, including natural language and source code

### Outputs
- Generated text, which can be used for a variety of language modeling tasks

## Capabilities

The `mpt-7b` model is a powerful language model with impressive capabilities. It can be used for tasks like text generation, summarization, and translation. The model's large training dataset and long context length make it well-suited for working with long-form text, such as writing stories or generating technical documentation.

## What can I use it for?

The `mpt-7b` model can be used for a variety of natural language processing tasks, such as:

- **Content creation**: Use the model to generate draft text for blogs, articles, or stories, which can then be edited and refined.
- **Technical writing**: Leverage the model's knowledge of code and technical concepts to assist in generating technical documentation or other software-related content.
- **Chatbots and virtual assistants**: Fine-tune the model for conversational tasks to create more engaging and capable chatbots and virtual assistants.

The model's commercial licensing also makes it suitable for use in commercial applications, unlike some other open-source language models.

## Things to try

One interesting aspect of the `mpt-7b` model is its ability to handle extremely long inputs, thanks to the use of ALiBi. This could be leveraged to generate long-form content, such as novels or academic papers, by providing the model with detailed outlines or prompts as input. The model's efficiency and speed also make it a good candidate for experimentation with different prompt engineering techniques or fine-tuning approaches.

[](#mpt-7b-storywriter-65k)MPT-7B-StoryWriter-65k+
==================================================

MPT-7B-StoryWriter-65k+ is a model designed to read and write fictional stories with super long context lengths. It was built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the [books3 dataset](https://huggingface.co/datasets/the_pile_books3). At inference time, thanks to [ALiBi](https://arxiv.org/abs/2108.12409), MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens. We demonstrate generations as long as 84k tokens on a single node of 8 A100-80GB GPUs in our [blogpost](https://www.mosaicml.com/blog/mpt-7b).

*   License: Apache 2.0

This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.

[](#model-date)Model Date
-------------------------

May 5, 2023

[](#model-license)Model License
-------------------------------

Apache 2.0

[](#documentation)Documentation
-------------------------------

*   [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](https://www.mosaicml.com/blog/mpt-7b)
*   [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
*   Questions: Feel free to contact us via the [MosaicML Community Slack](https://mosaicml.me/slack)!

[](#how-to-use)How to Use
-------------------------

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom model architecture that is not yet part of the `transformers` package.

It includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.

    import transformers
    model = transformers.AutoModelForCausalLM.from_pretrained(
      'mosaicml/mpt-7b-storywriter',
      trust_remote_code=True
    )
    

To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:

    import torch
    import transformers
    
    name = 'mosaicml/mpt-7b-storywriter'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.attn_config['attn_impl'] = 'triton'
    config.init_device = 'cuda:0' # For fast initialization directly on GPU!
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      torch_dtype=torch.bfloat16, # Load model weights in bfloat16
      trust_remote_code=True
    )
    

Although the model was trained with a sequence length of 2048 and finetuned with a sequence length of 65536, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:

    import transformers
    
    name = 'mosaicml/mpt-7b'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.max_seq_len = 83968 # (input + output) tokens can now be up to 83968
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      trust_remote_code=True
    )
    

This model was trained with the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
    

The model can then be used, for example, within a text-generation pipeline.  
Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).

    from transformers import pipeline
    
    pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
    
    with torch.autocast('cuda', dtype=torch.bfloat16):
        print(
            pipe('Here is a recipe for vegan banana bread:\n',
                max_new_tokens=100,
                do_sample=True,
                use_cache=True))
    

[](#community-created-resources)Community-Created Resources
-----------------------------------------------------------

These were not created by MosaicML, but you may find them useful. These links are not an endorsement of the creators or their content.

*   [Oobabooga Running MPT-7B-Storywriter](https://youtu.be/QVVb6Md6huA)
*   [NEW MPT-7B-StoryWriter CRUSHES GPT-4!](https://www.youtube.com/watch?v=O9Y_ZdsuKWQ&t=649s) - Has a long section on running locally using Oobabooga

[](#example-epilogue)Example Epilogue
-------------------------------------

The full text of the _The Great Gatsby_ (67873 tokens) was fed to the model, followed by the text "EPILOGUE"

**TheRealFitzgerald233**:

> ...
> 
> So we beat on, boats against the current, borne back ceaselessly into the past
> 
> EPILOGUE

**MPT-7B-StoryWriter**:

> I have lived so long in the shadow of this dream that I may never see the bright sunlight of reality. I may look up some morning and find that I am six feet under, and that Gatsby, who gave me the dream and everything else, will have vanished behind me.
> 
> And so it was. I had hardly crossed the threshold of the old world when my eye caught the headlines of a newspaper, and I knew at once that my dream was over. I knew that Gatsby was dead.
> 
> And I knew that day that I didn't love her.
> 
> That night I took the last train for West Egg, and when I reached home I went directly to my attic and brought down all Gatsby's letters.
> 
> I have sorted them by date and placed them in the order in which they were written. They are very few in number.
> 
> They are dated between April of nineteen eighteen and January of nineteen nineteen, and the last one I have here is the one that told me he was dead.
> 
> There is something in that last letter that frightens me. It is the sense of a finality, of a completeness. I never heard of him after that, and I was glad.
> 
> But a few months ago I read somewhere that he had been seen in New York. He had become a great man.
> 
> And I knew that he had not changed at all.

[](#model-description)Model Description
---------------------------------------

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

*   It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
*   It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
*   It does not use biases

Hyperparameter

Value

n\_parameters

6.7B

n\_layers

32

n\_heads

32

d\_model

4096

vocab size

50432

sequence length

**65536**

[](#pretraining-data)PreTraining Data
-------------------------------------

For more details on the pretraining process, see [MPT-7B](https://huggingface.co/mosaicml/mpt-7b).

The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

### [](#training-configuration)Training Configuration

This model was trained on 8 A100-80GBs for about 2 days using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the [LION](https://arxiv.org/abs/2302.06675) optimizer.

[](#limitations-and-biases)Limitations and Biases
-------------------------------------------------

_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_

MPT-7B-StoryWriter can produce factually incorrect output, and should not be relied on to produce factually accurate information. MPT-7B-StoryWriter was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

[](#acknowledgements)Acknowledgements
-------------------------------------

This model was finetuned by Alex Trott and the MosaicML NLP team

[](#mosaicml-platform)MosaicML Platform
---------------------------------------

If you're interested in [training](https://www.mosaicml.com/training) and [deploying](https://www.mosaicml.com/inference) your own MPT or LLMs on the MosaicML Platform, [sign up here](https://forms.mosaicml.com/demo?utm_source=huggingface&utm_medium=referral&utm_campaign=mpt-7b).

[](#disclaimer)Disclaimer
-------------------------

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.

[](#citation)Citation
---------------------

Please cite this model using the following format:

    @online{MosaicML2023Introducing,
        author    = {MosaicML NLP Team},
        title     = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
        year      = {2023},
        url       = {www.mosaicml.com/blog/mpt-7b},
        note      = {Accessed: 2023-03-28}, % change this date
        urldate   = {2023-03-28} % change this date
    }

## Model Overview

The `mpt-7b-storywriter` is a large language model developed by [MosaicML](https://www.mosaicml.com/) that is designed to read and write fictional stories with very long context lengths. It was built by fine-tuning the base MPT-7B model on a filtered fiction subset of the [books3 dataset](https://huggingface.co/datasets/the_pile_books3). The model utilizes techniques like [ALiBi](https://arxiv.org/abs/2108.12409) to handle extrapolating beyond its 65k token training context length, demonstrating generations up to 84k tokens.

The `mpt-7b-storywriter` model is part of the MosaicPretrainedTransformer (MPT) family, which uses a modified transformer architecture optimized for efficient training and inference. These architectural changes include performance-optimized layer implementations and the elimination of context length limits. The MPT models can be served efficiently with both standard Hugging Face pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).

## Model Inputs and Outputs

### Inputs
- Text prompts of up to 65,536 tokens in length, thanks to the use of ALiBi

### Outputs
- Continued story text generation, with the ability to extrapolate beyond the 65k token training context length up to 84k tokens

## Capabilities

The `mpt-7b-storywriter` model is designed to excel at generating long-form fictional stories. It can handle extremely long input contexts and produce coherent, extended narratives. This makes it well-suited for tasks like creative writing assistance, story generation, and even interactive storytelling applications.

## What Can I Use It For?

The `mpt-7b-storywriter` model can be used for a variety of creative writing and storytelling applications. Some potential use cases include:

- Generating original story ideas and plot outlines
- Assisting human writers by producing narrative continuations and story extensions
- Creating interactive fiction or choose-your-own-adventure style narratives
- Developing conversational storytelling agents or interactive characters

## Things to Try

One interesting aspect of the `mpt-7b-storywriter` model is its ability to handle extremely long input context lengths and produce cohesive, extended narratives. You could try providing the model with a short story prompt and see how it continues and develops the story over many thousands of tokens. Alternatively, you could experiment with giving the model partial story outlines or character descriptions and see how it fleshes out the narrative.

Another intriguing possibility is to fine-tune or adapt the `mpt-7b-storywriter` model for specific genres, styles, or storytelling formats. This could involve further training on domain-specific datasets or incorporating custom prompting techniques to tailor the model's outputs.

[](#mpt-7b-chat)MPT-7B-Chat
===========================

MPT-7B-Chat is a chatbot-like model for dialogue generation. It was built by finetuning [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3), [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets.

*   License: _CC-By-NC-SA-4.0_ (non-commercial use only)
*   [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-chat)

This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.

[](#model-date)Model Date
-------------------------

May 5, 2023

[](#model-license)Model License
-------------------------------

_CC-By-NC-SA-4.0_ (non-commercial use only)

[](#documentation)Documentation
-------------------------------

*   [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](https://www.mosaicml.com/blog/mpt-7b)
*   [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
*   Questions: Feel free to contact us via the [MosaicML Community Slack](https://mosaicml.me/slack)!

### [](#example-dialogue)Example Dialogue

**SamIAm85**:

> I want you to come up with a tweet based on this summary of the article: "Introducing MPT-7B, the latest entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on IT tokens of text and code. It is open source and it matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. Starting today, you can train, finetune, and deploy your own private MPT models, either starting from one of our checkpoints or training from scratch. For inspiration, we are also releasing three finetuned models in addition to the base MPT-7B: MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-Story Writer-65k+, the last of which uses a context length of 65k tokens!

**MPT-7B-Instruct**:

> Here's my suggestion for a tweet: "Exciting news! @MosaicML has launched #MPT\_7B - a new standard for open-source LLMs. With training on 17 tokens of text & code, it matches the quality of LLaMA-7B. Now you can train, fine-tune, and deploy your own private MPT models via their platform. Check out the full blog post here: [https://bit.ly/3j6UJzZ"](https://bit.ly/3j6UJzZ")

**SamIAm85**:

> Is there anything you want to say to MosaicML about this exciting news?

**MPT-7B-Instruct**:

> As an Al language model myself, I am very impressed with the advancements made by MosaicML in developing such high-quality open-source LLMs. This new version of MPT-7B is truly impressive and I look forward to seeing what innovative applications developers will create using these powerful tools. Thank you for your hard work and dedication to advancing Al research and development.

[](#how-to-use)How to Use
-------------------------

This model is best used with the MosaicML [llm-foundry repository](https://github.com/mosaicml/llm-foundry) for training and finetuning.

    import transformers
    model = transformers.AutoModelForCausalLM.from_pretrained(
      'mosaicml/mpt-7b-chat',
      trust_remote_code=True
    )
    

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package. `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.

To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:

    import torch
    import transformers
    
    name = 'mosaicml/mpt-7b-chat'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.attn_config['attn_impl'] = 'triton'
    config.init_device = 'cuda:0' # For fast initialization directly on GPU!
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      torch_dtype=torch.bfloat16, # Load model weights in bfloat16
      trust_remote_code=True
    )
    

Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:

    import transformers
    
    name = 'mosaicml/mpt-7b-chat'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      trust_remote_code=True
    )
    

This model was trained with the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
    

The model can then be used, for example, within a text-generation pipeline.  
Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).

    from transformers import pipeline
    
    pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
    
    with torch.autocast('cuda', dtype=torch.bfloat16):
        print(
            pipe('Here is a recipe for vegan banana bread:\n',
                max_new_tokens=100,
                do_sample=True,
                use_cache=True))
    

[](#model-description)Model Description
---------------------------------------

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

*   It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
*   It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
*   It does not use biases

Hyperparameter

Value

n\_parameters

6.7B

n\_layers

32

n\_heads

32

d\_model

4096

vocab size

50432

sequence length

2048

### [](#training-configuration)Training Configuration

This model was trained on 8 A100-80GBs for about 8.2 hours, followed by training for 6.7 hours on 32 A100-40GBs using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the AdamW optimizer.

[](#limitations-and-biases)Limitations and Biases
-------------------------------------------------

_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_

MPT-7B-Chat can produce factually incorrect output, and should not be relied on to produce factually accurate information. MPT-7B-Chat was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

[](#acknowledgements)Acknowledgements
-------------------------------------

This model was finetuned by Sam Havens and the MosaicML NLP team

[](#disclaimer)Disclaimer
-------------------------

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.

[](#mosaicml-platform)MosaicML Platform
---------------------------------------

If you're interested in [training](https://www.mosaicml.com/training) and [deploying](https://www.mosaicml.com/inference) your own MPT or LLMs on the MosaicML Platform, [sign up here](https://forms.mosaicml.com/demo?utm_source=huggingface&utm_medium=referral&utm_campaign=mpt-7b).

[](#citation)Citation
---------------------

Please cite this model using the following format:

    @online{MosaicML2023Introducing,
        author    = {MosaicML NLP Team},
        title     = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
        year      = {2023},
        url       = {www.mosaicml.com/blog/mpt-7b},
        note      = {Accessed: 2023-03-28}, % change this date
        urldate   = {2023-03-28} % change this date
    }

## Model Overview

`mpt-7b-chat` is a chatbot-like model for dialogue generation. It was built by fine-tuning [MPT-7B](https://aimodels.fyi/models/huggingFace/mpt-7b-mosaicml) on several datasets, including [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3), [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k). This allows the model to engage in more natural, open-ended dialogue compared to the base MPT-7B model.

## Model Inputs and Outputs

### Inputs
- Text prompts that the model will use to generate a response.

### Outputs
- Generated text responses that continue the dialogue based on the input prompt.

## Capabilities

`mpt-7b-chat` can engage in freeform dialogue on a wide range of topics. It demonstrates strong language generation abilities and can provide detailed, contextual responses. For example, it can discuss programming concepts, generate gourmet meal recipes, and even roleplay as characters from fiction.

## What Can I Use It For?

The `mpt-7b-chat` model could be used to power chatbots, virtual assistants, or other applications that require natural language interaction. Its ability to continue a conversation and provide relevant, engaging responses makes it well-suited for customer service, education, entertainment, and other applications where users need to interact with an AI system.

## Things to Try

One interesting aspect of `mpt-7b-chat` is its ability to maintain context and persona over multiple turns of a conversation. Try providing the model with a detailed system prompt that establishes its identity and goals, then see how it responds to a series of follow-up questions or requests. This can help you explore the model's conversational capabilities and understand how it uses the provided context to inform its responses.

[](#mpt-7b-instruct)MPT-7B-Instruct
===================================

MPT-7B-Instruct is a model for short-form instruction following. It is built by finetuning [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) on a [dataset](https://huggingface.co/datasets/sam-mosaic/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.

*   License: Apache 2.0
*   [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)

This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.

[](#model-date)Model Date
-------------------------

May 5, 2023

[](#model-license)Model License
-------------------------------

Apache 2.0

[](#documentation)Documentation
-------------------------------

*   [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](https://www.mosaicml.com/blog/mpt-7b)
*   [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
*   Questions: Feel free to contact us via the [MosaicML Community Slack](https://mosaicml.me/slack)!

### [](#example-questioninstruction)Example Question/Instruction

**Longboi24**:

> What is a quoll?

**MPT-7B-Instruct**:

> A Quoll (pronounced cool) is one of Australias native carnivorous marsupial mammals, which are also known as macropods or wallabies in other parts around Asia and South America

[](#how-to-use)How to Use
-------------------------

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom model architecture that is not yet part of the `transformers` package.

It includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.

    import transformers
    model = transformers.AutoModelForCausalLM.from_pretrained(
      'mosaicml/mpt-7b-instruct',
      trust_remote_code=True
    )
    

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package. `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.

To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:

    import torch
    import transformers
    
    name = 'mosaicml/mpt-7b-instruct'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.attn_config['attn_impl'] = 'triton'
    config.init_device = 'cuda:0' # For fast initialization directly on GPU!
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      torch_dtype=torch.bfloat16, # Load model weights in bfloat16
      trust_remote_code=True
    )
    

Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:

    import transformers
    
    name = 'mosaicml/mpt-7b-instruct'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      trust_remote_code=True
    )
    

This model was trained with the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
    

The model can then be used, for example, within a text-generation pipeline.  
Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).

    from transformers import pipeline
    
    pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
    
    with torch.autocast('cuda', dtype=torch.bfloat16):
        print(
            pipe('Here is a recipe for vegan banana bread:\n',
                max_new_tokens=100,
                do_sample=True,
                use_cache=True))
    

### [](#formatting)Formatting

This model was trained on data formatted in the dolly-15k format:

    INSTRUCTION_KEY = "### Instruction:"
    RESPONSE_KEY = "### Response:"
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    PROMPT_FOR_GENERATION_FORMAT = """{intro}
    {instruction_key}
    {instruction}
    {response_key}
    """.format(
        intro=INTRO_BLURB,
        instruction_key=INSTRUCTION_KEY,
        instruction="{instruction}",
        response_key=RESPONSE_KEY,
    )
    
    example = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week? Explain before answering."
    fmt_ex = PROMPT_FOR_GENERATION_FORMAT.format(instruction=example)
    

In the above example, `fmt_ex` is ready to be tokenized and sent through the model.

[](#model-description)Model Description
---------------------------------------

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

*   It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
*   It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
*   It does not use biases

Hyperparameter

Value

n\_parameters

6.7B

n\_layers

32

n\_heads

32

d\_model

4096

vocab size

50432

sequence length

2048

[](#pretraining-data)PreTraining Data
-------------------------------------

For more details on the pretraining process, see [MPT-7B](https://huggingface.co/mosaicml/mpt-7b).

The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

### [](#training-configuration)Training Configuration

This model was trained on 8 A100-40GBs for about 2.3 hours using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the AdamW optimizer.

[](#limitations-and-biases)Limitations and Biases
-------------------------------------------------

_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_

MPT-7B-Instruct can produce factually incorrect output, and should not be relied on to produce factually accurate information. MPT-7B-Instruct was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

[](#acknowledgements)Acknowledgements
-------------------------------------

This model was finetuned by Sam Havens and the MosaicML NLP team

[](#mosaicml-platform)MosaicML Platform
---------------------------------------

If you're interested in [training](https://www.mosaicml.com/training) and [deploying](https://www.mosaicml.com/inference) your own MPT or LLMs on the MosaicML Platform, [sign up here](https://forms.mosaicml.com/demo?utm_source=huggingface&utm_medium=referral&utm_campaign=mpt-7b).

[](#disclaimer)Disclaimer
-------------------------

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.

[](#citation)Citation
---------------------

Please cite this model using the following format:

    @online{MosaicML2023Introducing,
        author    = {MosaicML NLP Team},
        title     = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
        year      = {2023},
        url       = {www.mosaicml.com/blog/mpt-7b},
        note      = {Accessed: 2023-03-28}, % change this date
        urldate   = {2023-03-28} % change this date
    }

## Model Overview

`mpt-7b-instruct` is a model for short-form instruction following. It was built by finetuning [MPT-7B](https://aimodels.fyi/models/huggingFace/mpt-7b-mosaicml) on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets. This model was trained by [MosaicML](https://aimodels.fyi/creators/huggingFace/mosaicml).

## Model Inputs and Outputs

This is a text-to-text model, taking in natural language text and generating new text in response. The model can handle a wide range of input prompts and produce diverse outputs, from succinct factual answers to engaging stories. 

### Inputs
- Natural language text prompts, which can include instructions, questions, or open-ended requests

### Outputs
- Generated text relevant to the input prompt
- Outputs can range from short factual responses to longer narrative pieces

## Capabilities

`mpt-7b-instruct` demonstrates strong performance on a variety of language tasks, including question answering, summarization, and open-ended generation. For example, when given the prompt "What is a quoll?", the model provides a detailed explanation of this Australian marsupial. The model can also generate creative stories and engage in open-ended dialogue when prompted.

## What Can I Use It For?

The `mpt-7b-instruct` model could be useful for a variety of applications that require natural language processing, such as:

- Building chatbots or virtual assistants that can understand and respond to user instructions
- Automating content creation tasks like writing summaries, articles, or creative fiction
- Enhancing search engines or question-answering systems with more natural language understanding

## Things to Try

One interesting aspect of the `mpt-7b-instruct` model is its ability to handle very long input sequences, thanks to the use of [ALiBi](https://arxiv.org/abs/2108.12409). You could try providing the model with long passages of text, such as entire books or lengthy articles, and see how it responds to open-ended prompts or generates continuations. The model's capacity for handling long-form content makes it a compelling tool for tasks like story generation or summarization.

[](#mpt-30b)MPT-30B
===================

MPT-30B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code. This model was trained by [MosaicML](https://www.mosaicml.com).

MPT-30B is part of the family of Mosaic Pretrained Transformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.

MPT-30B comes with special features that differentiate it from other LLMs, including an 8k token context window (which can be further extended via finetuning; see [MPT-7B-StoryWriter](https://huggingface.co/mosaicml/mpt-7b-storywriter)), support for context-length extrapolation via [ALiBi](https://arxiv.org/abs/2108.12409), and efficient inference + training via FlashAttention. It also has strong coding abilities thanks to its pretraining mix. MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer). The size of MPT-30B was also specifically chosen to make it easy to deploy on a single GPUeither 1xA100-80GB in 16-bit precision or 1xA100-40GB in 8-bit precision.

This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry). It was trained by MosaicMLs NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for LLM pretraining, finetuning, and inference.

### [](#how-is-this-model-different)How is this model different?

MPT-30B is:

*   **Licensed for the possibility of commercial use** (unlike [LLaMA](https://arxiv.org/abs/2302.13971)).
*   **Trained on a large amount of data** (1T tokens like [LLaMA](https://arxiv.org/abs/2302.13971) vs. 300B for [Pythia](https://github.com/EleutherAI/pythia), 300B for [OpenLLaMA](https://github.com/openlm-research/open_llama), and 800B for [StableLM](https://github.com/Stability-AI/StableLM)).
*   **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409).
*   **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
*   **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)

### [](#models-finetuned-off-mpt-30b)Models finetuned off MPT-30B:

The following models are finetuned on MPT-30B:

*   [MPT-30B-Instruct](https://huggingface.co/mosaicml/mpt-30b-instruct): a model for long-form instruction following (especially summarization and question-answering). Built by finetuning MPT-30B on several carefully curated datasets.
    
    *   License: _CC-BY-SA-3.0_
*   [MPT-30B-Chat](https://huggingface.co/mosaicml/mpt-30b-chat): a chatbot-like model for dialogue generation. Built by finetuning MPT-30B on [ShareGPT-Vicuna](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered), [Camel-AI](https://huggingface.co/camel-ai), [GPTeacher](https://github.com/teknium1/GPTeacher), [Guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), [Baize](https://github.com/project-baize/baize-chatbot) and some generated datasets.
    
    *   License: _CC-By-NC-SA-4.0_
    *   [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-30b-chat)

[](#model-date)Model Date
-------------------------

June 22, 2023

[](#model-license)Model License
-------------------------------

Apache-2.0

[](#documentation)Documentation
-------------------------------

*   [Blog post: MPT-30B: Raising the bar for open-source foundation models](https://www.mosaicml.com/blog/mpt-30b)
*   [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
*   Questions: Feel free to contact us via the [MosaicML Community Slack](https://mosaicml.me/slack)!

[](#how-to-use)How to Use
-------------------------

This model is best used with the MosaicML [llm-foundry repository](https://github.com/mosaicml/llm-foundry) for training and finetuning.

    import transformers
    model = transformers.AutoModelForCausalLM.from_pretrained(
      'mosaicml/mpt-30b',
      trust_remote_code=True
    )
    

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package. `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.

To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:

    import torch
    import transformers
    
    name = 'mosaicml/mpt-30b'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.attn_config['attn_impl'] = 'triton'  # change this to use triton-based FlashAttention
    config.init_device = 'cuda:0' # For fast initialization directly on GPU!
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      torch_dtype=torch.bfloat16, # Load model weights in bfloat16
      trust_remote_code=True
    )
    

The model was trained initially with a sequence length of 2048 with an additional pretraining stage for sequence length adapation up to 8192. However, ALiBi enables users to increase the maximum sequence length even further during finetuning and/or inference. For example:

    import transformers
    
    name = 'mosaicml/mpt-30b'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.max_seq_len = 16384 # (input + output) tokens can now be up to 16384
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      trust_remote_code=True
    )
    

This model was trained with the MPT-30B tokenizer which is identical to the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained('mosaicml/mpt-30b')
    

The model can then be used, for example, within a text-generation pipeline.  
Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).

    from transformers import pipeline
    
    with torch.autocast('cuda', dtype=torch.bfloat16):
        inputs = tokenizer('Here is a recipe for vegan banana bread:\n', return_tensors="pt").to('cuda')
        outputs = model.generate(**inputs, max_new_tokens=100)
        print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
    
    # or using the HF pipeline
    pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
    with torch.autocast('cuda', dtype=torch.bfloat16):
        print(
            pipe('Here is a recipe for vegan banana bread:\n',
                max_new_tokens=100,
                do_sample=True,
                use_cache=True))
    

[](#model-description)Model Description
---------------------------------------

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

*   It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
*   It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
*   It does not use biases

Hyperparameter

Value

n\_parameters

29.95B

n\_layers

48

n\_heads

64

d\_model

7168

vocab size

50432

sequence length

8192

[](#training-data)Training Data
-------------------------------

### [](#streaming-datasets)Streaming Datasets

Data was formatted using the MosaicML [StreamingDataset](https://github.com/mosaicml/streaming) library to host our data in object storage and efficiently stream it to our compute cluster during training. StreamingDataset obviates the need to download the whole dataset before starting training, and allows instant resumption of training from any point in the dataset.

### [](#data-mix)Data Mix

The model was trained for 1T tokens on the following data mix:

Data Source

Number of Tokens in Source

Proportion

Effective Number of Tokens

Epochs

mC4 3.1.0 - English (200+ words)

2417.99 B

33.50%

335 B

0.14

c4 - English - SemDedup 80%

100.42 B

29.90%

299 B

2.98

RedPajama - CommonCrawl

878.45 B

8.50%

85 B

0.097

The Stack - Selected Languages

463.78 B

10.00%

100 B

0.22

RedPajama - Wikipedia

4.87 B

4.00%

40 B

8.21

The Stack - Markdown

107.07 B

4.50%

45 B

0.42

Semantic Scholar ORC

48.95 B

3.30%

33 B

0.67

RedPajama - Books

26.02 B

3.00%

30 B

1.15

RedPajama - arXiv

28.10 B

1.90%

19 B

0.68

RedPajama - StackExchange

20.54 B

1.40%

14 B

0.68

Samples for each batch were selected from one of the datasets with the probability specified above. The examples were shuffled within each dataset, and each example was constructed from as many sequences from that dataset as were necessary to fill the sequence length. To build 8k support into MPT-30B efficiently, we first pre-trained on 1T tokens using sequences that were 2k tokens long, and then trained for an additional 50B tokens using sequences that were 8k tokens long.

The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer. This BPE tokenizer has a number of desirable characteristics, most of which are relevant for tokenizing code: (1) It was trained on a diverse mix of data that includes code (The Pile) (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.

The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)).

### [](#training-configuration)Training Configuration

The model was trained in three stages using the [MosaicML Platform](https://www.mosaicml.com/platform): (i) First it was trained on 440 A100-40GBs with a batch size of 1760. (ii) Then, on 216 A100-40GBs with a batch size of 1728. (iii) Training was completed on 256 H100-80GBs with a batch size of 512 with 8k context length and 50B tokens. The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the [LION](https://arxiv.org/abs/2302.06675) optimizer.

[](#limitations-and-biases)Limitations and Biases
-------------------------------------------------

_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_

MPT-30B (Base) is **not** intended for deployment without finetuning. It should not be used for human-facing interactions without further guardrails and user consent.

MPT-30B can produce factually incorrect output, and should not be relied on to produce factually accurate information. MPT-30B was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

[](#mosaicml-platform)MosaicML Platform
---------------------------------------

If you're interested in [training](https://www.mosaicml.com/training) and [deploying](https://www.mosaicml.com/inference) your own MPT or LLMs on the MosaicML Platform, [sign up here](https://forms.mosaicml.com/demo?utm_source=huggingface&utm_medium=referral&utm_campaign=mpt-30b).

[](#disclaimer)Disclaimer
-------------------------

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

[](#citation)Citation
---------------------

Please cite this model using the following format:

    @online{MosaicML2023Introducing,
        author    = {MosaicML NLP Team},
        title     = {Introducing MPT-30B: Raising the bar
    for open-source foundation models},
        year      = {2023},
        url       = {www.mosaicml.com/blog/mpt-30b},
        note      = {Accessed: 2023-06-22},
        urldate   = {2023-06-22}
    }

## Model overview

The `mpt-30b` is a large language model trained by MosaicML, a company focused on developing cutting-edge AI models. It is part of the Mosaic Pretrained Transformer (MPT) family of models, which use a modified transformer architecture optimized for efficient training and inference. 

The `mpt-30b` model was trained on 1 trillion tokens of English text and code, significantly more data than models like [LLaMA](https://arxiv.org/abs/2302.13971) (300 billion tokens), [Pythia](https://github.com/EleutherAI/pythia) (300 billion), [OpenLLaMA](https://github.com/openlm-research/open_llama) (300 billion), and [StableLM](https://github.com/Stability-AI/StableLM) (800 billion). This allows the `mpt-30b` to have strong capabilities across a wide range of natural language tasks.

Additionally, the `mpt-30b` includes several architectural innovations that set it apart, like support for an 8k token context window (which can be further extended via finetuning), context-length extrapolation via [ALiBi](https://arxiv.org/abs/2108.12409), and efficient inference and training via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf). These features enable the model to handle very long inputs and generate coherent text, making it well-suited for tasks like long-form writing.

## Model inputs and outputs

### Inputs
- **Text**: The `mpt-30b` model takes in natural language text as input, which can range from short prompts to long-form passages.

### Outputs
- **Generated text**: The primary output of the `mpt-30b` model is continuation of the input text, generating coherent and contextually relevant output. The model can be used for a variety of text generation tasks, from creative writing to question-answering.

## Capabilities

The `mpt-30b` model has shown strong performance on a wide range of language tasks, including text generation, question-answering, and code generation. Its large scale and architectural innovations allow it to handle long-form inputs and outputs effectively. For example, the model can be used to generate multi-paragraph stories or long-form instructional content.

## What can I use it for?

The `mpt-30b` model is well-suited for a variety of natural language processing applications, particularly those that require handling long-form text. Some potential use cases include:

- **Content creation**: The model can be used to assist with writing tasks like creative fiction, technical documentation, or marketing copy.
- **Question-answering**: With its strong understanding of language, the `mpt-30b` can be used to build chatbots or virtual assistants that can engage in informative and contextual conversations.
- **Code generation**: Due to its training on a mix of text and code, the model can be used to generate or assist with writing code.

Companies looking to leverage large language models for their business could consider finetuning the `mpt-30b` on their own data to create custom AI assistants or content generation tools. The [MosaicML Platform](https://www.mosaicml.com/training) provides tools and services to help with this process.

## Things to try

One interesting aspect of the `mpt-30b` model is its ability to handle very long inputs and outputs due to the [ALiBi](https://arxiv.org/abs/2108.12409) architecture. This could make it well-suited for tasks like long-form story generation or summarization of lengthy documents. Experimenting with pushing the boundaries of the model's context window could yield compelling results.

Additionally, the model's strong performance on both text and code suggests it could be a powerful tool for developing AI-assisted programming workflows. Prompting the model with high-level instructions or pseudocode and seeing how it translates that into working code could be an illuminating exercise.

Overall, the `mpt-30b` represents a significant step forward in the development of large language models, and its combination of scale, capability, and efficiency make it an intriguing model to explore and experiment with.

[](#mpt-30b-chat)MPT-30B-Chat
=============================

MPT-30B-Chat is a chatbot-like model for dialogue generation. It was built by finetuning [MPT-30B](https://huggingface.co/mosaicml/mpt-30b) on the [ShareGPT-Vicuna](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered), [Camel-AI](https://huggingface.co/camel-ai), [GPTeacher](https://github.com/teknium1/GPTeacher), [Guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), [Baize](https://github.com/project-baize/baize-chatbot) and some generated datasets.

*   License: _CC-By-NC-SA-4.0_ (non-commercial use only)

This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.

[](#model-date)Model Date
-------------------------

June 22, 2023

[](#model-license)Model License
-------------------------------

_CC-By-NC-SA-4.0_ (non-commercial use only)

[](#documentation)Documentation
-------------------------------

*   [Blog post: Raising the bar for open-source foundation models](https://www.mosaicml.com/blog/mpt-30b)
*   [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
*   Questions: Feel free to contact us via the [MosaicML Community Slack](https://mosaicml.me/slack)!

### [](#example-dialogue)Example Dialogue

**ksreenivasan**:

> I want you to come up with a tweet based on this summary of the article:
> 
> We are excited to extend the MosaicML Foundation Series with MPT-30B, a new commercially licensed open-source model that is significantly more powerful, and outperforms the original GPT-3. In addition, we are releasing two fine-tuned models, MPT-30B-Instruct and MPT-30B-Chat, that are built on top of MPT-30B and excel at short-form instruction following and multi-turn conversations, respectively. All MPT-30B models come with special features that differentiate them from other LLMs, including an 8k token context window (which can be further extended via finetuning; see MPT-7B-StoryWriter), support for context-length extrapolation via ALiBi, and efficient inference + training performance via FlashAttention. It also has strong coding abilities thanks to its pretraining data mixture.

**MPT-30B-Chat**:

> "Revolutionize your AI capabilities with MosaicML's latest addition, MPT-30B! Outperforming GPT-3, this open-source model is significantly more powerful and comes with finely-tuned models for exceptional instruction following and multi-turn conversations. Unleash the potential of an 8k token context window, ALiBi support, and FlashAttention. Experience the future of AI with MosaicML! #AI #ML #innovation"

**ksreenivasan**:

> Is there anything you want to say to MosaicML about this exciting news?

**MPT-30B-Chat**:

> As your AI marketing assistant, I would say that this is truly exciting news for the AI community! MosaicML's latest addition, MPT-30B, is poised to revolutionize the industry with its powerful open-source capabilities and finely-tuned models for instruction following and multi-turn conversations. The special features, including the 8k token context window, ALiBi support, and FlashAttention, are sure to set MosaicML's MPT-30B apart from other LLMs. Congratulations to the MosaicML team on this incredible achievement!

[](#how-to-use)How to Use
-------------------------

This model is best used with the MosaicML [llm-foundry repository](https://github.com/mosaicml/llm-foundry) for training and finetuning.

    import transformers
    model = transformers.AutoModelForCausalLM.from_pretrained(
      'mosaicml/mpt-30b-chat',
      trust_remote_code=True
    )
    

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package. `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.

To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:

    import torch
    import transformers
    
    name = 'mosaicml/mpt-30b-chat'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.attn_config['attn_impl'] = 'triton'  # change this to use triton-based FlashAttention
    config.init_device = 'cuda:0' # For fast initialization directly on GPU!
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      torch_dtype=torch.bfloat16, # Load model weights in bfloat16
      trust_remote_code=True
    )
    

The model was trained initially with a sequence length of 2048 with an additional pretraining stage for sequence length adapation up to 8192. However, ALiBi enables users to increase the maximum sequence length even further during finetuning and/or inference. For example:

    import transformers
    
    name = 'mosaicml/mpt-30b-chat'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.max_seq_len = 16384 # (input + output) tokens can now be up to 16384
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      trust_remote_code=True
    )
    

This model was trained with the MPT-30B tokenizer which is based on the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer and includes additional padding and eos tokens.

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained('mosaicml/mpt-30b')
    

The model can then be used, for example, within a text-generation pipeline.  
Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).

    from transformers import pipeline
    
    with torch.autocast('cuda', dtype=torch.bfloat16):
        inputs = tokenizer('Here is a recipe for vegan banana bread:\n', return_tensors="pt").to('cuda')
        outputs = model.generate(**inputs, max_new_tokens=100)
        print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
    
    # or using the HF pipeline
    pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
    with torch.autocast('cuda', dtype=torch.bfloat16):
        print(
            pipe('Here is a recipe for vegan banana bread:\n',
                max_new_tokens=100,
                do_sample=True,
                use_cache=True))
    

[](#model-description)Model Description
---------------------------------------

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

*   It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
*   It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
*   It does not use biases

Hyperparameter

Value

n\_parameters

29.95B

n\_layers

48

n\_heads

64

d\_model

7168

vocab size

50432

sequence length

8192

[](#data-mix)Data Mix
---------------------

The model was trained on the following data mix:

Data Source

Number of Tokens in Source

Proportion

Airoboros/GPT4-1.2

26.4M

1.71%

Baize

55.0M

3.57%

Camel

301M

19.54%

GPTeacher

7.56M

0.49%

Guanaco

15.6M

1.02%

LongCoversations

18.4M

1.19%

ShareGPT

821M

53.24%

WizardLM

297M

19.23%

"LongConversations" is a GPT3.5/4-generated dataset, details of which will be released at a later date.

### [](#training-configuration)Training Configuration

This model was trained on 64 H100s for about 7.6 hours using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the AdamW optimizer.

[](#limitations-and-biases)Limitations and Biases
-------------------------------------------------

_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_

MPT-30B-Chat can produce factually incorrect output, and should not be relied on to produce factually accurate information. MPT-30B-Chat was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

[](#acknowledgements)Acknowledgements
-------------------------------------

This model was finetuned by Sam Havens and the MosaicML NLP team

[](#disclaimer)Disclaimer
-------------------------

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

[](#mosaicml-platform)MosaicML Platform
---------------------------------------

If you're interested in [training](https://www.mosaicml.com/training) and [deploying](https://www.mosaicml.com/inference) your own MPT or LLMs on the MosaicML Platform, [sign up here](https://forms.mosaicml.com/demo?utm_source=huggingface&utm_medium=referral&utm_campaign=mpt-7b).

[](#citation)Citation
---------------------

Please cite this model using the following format:

    @online{MosaicML2023Introducing,
        author    = {MosaicML NLP Team},
        title     = {Introducing MPT-30B: Raising the bar
    for open-source foundation models},
        year      = {2023},
        url       = {www.mosaicml.com/blog/mpt-30b},
        note      = {Accessed: 2023-06-22},
        urldate   = {2023-06-22}
    }

## Model overview

`mpt-30b-chat` is a chatbot-like model for dialogue generation developed by [MosaicML](https://www.mosaicml.com). It was built by fine-tuning the larger [MPT-30B](https://huggingface.co/mosaicml/mpt-30b) model on several datasets, including [ShareGPT-Vicuna](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered), [Camel-AI](https://huggingface.co/camel-ai), [GPTeacher](https://github.com/teknium1/GPTeacher), [Guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), and [Baize](https://github.com/project-baize/baize-chatbot). This model follows a modified decoder-only transformer architecture and is licensed for non-commercial use only.

## Model inputs and outputs

The `mpt-30b-chat` model is designed for text-to-text tasks, taking in natural language prompts and generating relevant responses. It has an 8k token context window, which can be further extended via fine-tuning, and supports context-length extrapolation via ALiBi.

### Inputs
- Natural language prompts for conversation or dialogue

### Outputs
- Generated text responses to continue a conversation or provide relevant information

## Capabilities

The `mpt-30b-chat` model excels at engaging in multi-turn conversations and following short-form instructions. Its large 30B parameter size and fine-tuning on specialized datasets give it strong coding abilities and the capacity to handle a wide range of conversational topics.

## What can I use it for?

The `mpt-30b-chat` model can be used to power conversational AI assistants, chatbots, and interactive applications. Its capabilities make it well-suited for tasks like customer service, educational applications, and creative writing assistance. While licensed for non-commercial use only, interested parties can explore the model's potential on the MosaicML platform.

## Things to try

One interesting aspect of `mpt-30b-chat` is its ability to extrapolate beyond its 8k token context window through the use of ALiBi. This allows the model to maintain coherence and context over longer dialogues, opening up possibilities for more substantive and engaging conversations.

[](#mpt-30b-instruct)MPT-30B-Instruct
=====================================

MPT-30B-Instruct is a model for short-form instruction following. It is built by finetuning [MPT-30B](https://huggingface.co/mosaicml/mpt-30b) on [Dolly HHRLHF](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets. It is also trained on [Competition Math](https://huggingface.co/datasets/competition_math), [Duorc](https://huggingface.co/datasets/duorc), [CoT GSM8k](https://huggingface.co/datasets/conceptofmind/cot_submix_original), [Qasper](https://huggingface.co/datasets/allenai/qasper), [Quality](https://huggingface.co/datasets/emozilla/quality), [Summ Screen FD](https://huggingface.co/datasets/tau/scrolls) and [Spider](https://huggingface.co/datasets/spider).

*   License: Apache 2.0

This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.

[](#model-date)Model Date
-------------------------

June 22, 2023

[](#model-license)Model License
-------------------------------

Apache 2.0

[](#documentation)Documentation
-------------------------------

*   [Blog post: Raising the bar for open-source foundation models](https://www.mosaicml.com/blog/mpt-30b)
*   [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
*   Questions: Feel free to contact us via the [MosaicML Community Slack](https://mosaicml.me/slack)!

### [](#example-questioninstruction)Example Question/Instruction

**Bespokenizer46**

> I need to convince my friend, Phyllis, that she should train a custom LLM for her Fortune 500 company using the MosaicML Platform. Please write an email that explains why MosaicML's emphasis on cutting edge methodology, data privacy, and efficiency are so important. End the email with a friendly inquiry about Phyllis's family.

**MPT-30B-Instruct**:

> Phyllis - I know you're always looking for ways to improve efficiency at work and I think I've found a solution that you'll be very interested in. LLMs are only as good as the data they're trained on and the training process can be very time consuming and resource intensive. MosaicML uses cutting edge methodology to efficiently train high quality models without compromising data privacy. They also provide tools to easily connect to and use the model in your daily workflow. I think you'd really enjoy speaking with their founder, we can set up a call if you're interested. Also, I know it's been a tough year for your family, how are things?

> Best, Your Friend

[](#how-to-use)How to Use
-------------------------

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom model architecture that is not yet part of the `transformers` package.

It includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.

    import transformers
    model = transformers.AutoModelForCausalLM.from_pretrained(
      'mosaicml/mpt-30b-instruct',
      trust_remote_code=True
    )
    

To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:

    import torch
    import transformers
    
    name = 'mosaicml/mpt-30b-instruct'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.attn_config['attn_impl'] = 'triton'  # change this to use triton-based FlashAttention
    config.init_device = 'cuda:0' # For fast initialization directly on GPU!
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      torch_dtype=torch.bfloat16, # Load model weights in bfloat16
      trust_remote_code=True
    )
    

The model was trained initially on a sequence length of 2048. An additional pre-training phase was included for sequence length adaptation to 8192. However, ALiBi further enables users to increase the maximum sequence length during finetuning and/or inference. For example:

    import transformers
    
    name = 'mosaicml/mpt-30b-instruct'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.max_seq_len = 16384 # (input + output) tokens can now be up to 16384
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      trust_remote_code=True
    )
    

This model was trained with the MPT-30B tokenizer which is based on the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer and includes additional padding and eos tokens.

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained('mosaicml/mpt-30b')
    

The model can then be used, for example, within a text-generation pipeline.  
Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).

    from transformers import pipeline
    
    with torch.autocast('cuda', dtype=torch.bfloat16):
        inputs = tokenizer('Here is a recipe for vegan banana bread:\n', return_tensors="pt").to('cuda')
        outputs = model.generate(**inputs, max_new_tokens=100)
        print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
    
    # or using the HF pipeline
    pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
    with torch.autocast('cuda', dtype=torch.bfloat16):
        print(
            pipe('Here is a recipe for vegan banana bread:\n',
                max_new_tokens=100,
                do_sample=True,
                use_cache=True))
    

### [](#formatting)Formatting

This model was trained on data formatted as follows:

    def format_prompt(instruction):
        template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n###Instruction\n{instruction}\n\n### Response\n"
        return template.format(instruction=instruction)
    
    example = "Tell me a funny joke.\nDon't make it too funny though."
    fmt_ex = format_prompt(instruction=example)
    

In the above example, `fmt_ex` is ready to be tokenized and sent through the model.

[](#model-description)Model Description
---------------------------------------

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

*   It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
*   It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
*   It does not use biases

Hyperparameter

Value

n\_parameters

29.95B

n\_layers

48

n\_heads

64

d\_model

7168

vocab size

50432

sequence length

8192

[](#data-mix)Data Mix
---------------------

The model was trained on the following data mix:

Data Source

Number of Tokens in Source

Proportion

competition\_math

1.6 M

3.66%

cot\_gsm8k

3.36 M

7.67%

dialogsum

0.1 M

0.23%

dolly\_hhrlhf

5.89 M

13.43%

duorc

7.8 M

17.80%

qasper

8.72 M

19.90%

quality

11.29 M

25.78%

scrolls/summ\_screen\_fd

4.97 M

11.33%

spider

0.089 M

0.20%

[](#pretraining-data)PreTraining Data
-------------------------------------

For more details on the pretraining process, see [MPT-30B](https://huggingface.co/mosaicml/mpt-30b).

The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

### [](#training-configuration)Training Configuration

This model was trained on 72 A100 40GB GPUs for 8 hours using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the AdamW optimizer.

[](#limitations-and-biases)Limitations and Biases
-------------------------------------------------

_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_

MPT-30B-Instruct can produce factually incorrect output, and should not be relied on to produce factually accurate information. MPT-30B-Instruct was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

[](#acknowledgements)Acknowledgements
-------------------------------------

This model was finetuned by Sam Havens, Alex Trott, and the MosaicML NLP team

[](#mosaicml-platform)MosaicML Platform
---------------------------------------

If you're interested in [training](https://www.mosaicml.com/training) and [deploying](https://www.mosaicml.com/inference) your own MPT or LLMs on the MosaicML Platform, [sign up here](https://forms.mosaicml.com/demo?utm_source=huggingface&utm_medium=referral&utm_campaign=mpt-30b).

[](#disclaimer)Disclaimer
-------------------------

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

[](#citation)Citation
---------------------

Please cite this model using the following format:

    @online{MosaicML2023Introducing,
        author    = {MosaicML NLP Team},
        title     = {Introducing MPT-30B: Raising the bar
    for open-source foundation models},
        year      = {2023},
        url       = {www.mosaicml.com/blog/mpt-30b},
        note      = {Accessed: 2023-06-22},
        urldate   = {2023-06-22}
    }

## Model overview

The `mpt-30b-instruct` model is a powerful open-source language model developed by [MosaicML](https://www.mosaicml.com) that is designed for short-form instruction following. This model is built by fine-tuning the larger [MPT-30B](https://huggingface.co/mosaicml/mpt-30b) model on several datasets, including [Dolly HHRLHF](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf), [Competition Math](https://huggingface.co/datasets/competition_math), [Duorc](https://huggingface.co/datasets/duorc), and more. 

Compared to similar open-source models like [mpt-7b-instruct](https://aimodels.fyi/models/huggingFace/mpt-7b-instruct-mosaicml) and [mpt-30b-chat](https://aimodels.fyi/models/huggingFace/mpt-30b-chat-mosaicml), the `mpt-30b-instruct` model is significantly larger with 30 billion parameters, providing enhanced capabilities for tasks like instruction following. It utilizes the same modified decoder-only transformer architecture as other MPT models, which incorporates performance-boosting techniques like [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [ALiBi](https://arxiv.org/abs/2108.12409).

## Model inputs and outputs

### Inputs
- **Text prompts**: The model accepts natural language text prompts that describe a task or provide instructions for the model to follow.

### Outputs
- **Text responses**: The model generates text responses that complete the given task or follow the provided instructions.

## Capabilities

The `mpt-30b-instruct` model excels at a variety of short-form instruction following tasks, such as answering questions, solving math problems, summarizing texts, and more. It demonstrates strong language understanding and reasoning abilities, allowing it to interpret complex instructions and provide relevant, coherent responses.

## What can I use it for?

Developers and researchers can leverage the `mpt-30b-instruct` model for a wide range of applications that require natural language processing and generation capabilities. Some potential use cases include:

- **Question-answering systems**: Build chatbots or virtual assistants that can comprehend and respond to user queries.
- **Automated task completion**: Develop applications that can follow written instructions to perform various tasks, such as writing reports, generating code snippets, or solving math problems.
- **Content summarization**: Use the model to automatically summarize long-form text, such as articles or research papers, into concise summaries.

## Things to try

One interesting aspect of the `mpt-30b-instruct` model is its ability to handle long-form inputs and outputs, thanks to the use of [ALiBi](https://arxiv.org/abs/2108.12409) in its architecture. Developers can experiment with extending the model's context length during fine-tuning or inference to see how it performs on tasks that require generating or comprehending longer passages of text.

Additionally, the model's strong coding abilities, gained from its pretraining data mixture, make it a compelling choice for applications that involve code generation or analysis. Researchers and engineers can explore using the `mpt-30b-instruct` model for tasks like code completion, code summarization, or even automated programming.

[](#mpt-1b-redpajama-200b)MPT-1b-RedPajama-200b
===============================================

MPT-1b-RedPajama-200b is a 1.3 billion parameter decoder-only transformer trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T). The model was trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the [Llama series of models](https://arxiv.org/abs/2302.13971). This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.

[](#model-date)Model Date
-------------------------

April 20, 2023

[](#how-to-use)How to Use
-------------------------

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom model architecture `MosaicGPT` that is not yet part of the `transformers` package. `MosaicGPT` includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALIBI](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.

    import transformers
    model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b', trust_remote_code=True)
    

To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:

    model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b', trust_remote_code=True, attn_impl='triton')
    model.to(device='cuda:0', dtype=torch.bfloat16)
    

[](#model-description)Model Description
---------------------------------------

This model uses the MosaicML LLM codebase, which can be found in the [MosaicML Examples Repository](https://github.com/mosaicml/examples/tree/v0.0.4/examples/llm). The architecture is a modification of a standard decoder-only transformer. The transformer has 24 layers, 16 attention heads, and width 2048. The model has been modified from a standard transformer in the following ways:

*   It uses ALiBi and does not use positional embeddings.
*   It uses QK LayerNorm.
*   It does not use biases.

[](#training-data)Training Data
-------------------------------

The model was trained for 200B tokens (batch size 2200, sequence length 2048). It was trained on the following data mix:

*   67% RedPajama Common Crawl
*   15% [C4](https://huggingface.co/datasets/c4)
*   4.5% RedPajama GitHub
*   4.5% RedPajama Wikipedia
*   4.5% RedPajama Books
*   2.5% RedPajama Arxiv
*   2% RedPajama StackExchange

This is the same mix of data as was used in the Llama series of models\]([https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)).

Each sample was chosen from one of the datasets, with the dataset selected with the probability specified above. The examples were shuffled within each dataset. Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

[](#training-configuration)Training Configuration
-------------------------------------------------

This model was trained on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using FSDP.

[](#acknowledgements)Acknowledgements
-------------------------------------

This model builds on the work of [Together](https://www.together.xyz), which created the RedPajama dataset with the goal of mimicking the training data used to create the Llama series of models. We gratefully acknowledge the hard work of the team that put together this dataset, and we hope this model serves as a useful companion to that work.

We also gratefully acknowledge the work of the researchers who created the Llama series of models, which was the impetus for our efforts and those who worked on the RedPajama project.

## Model overview

The `mpt-1b-redpajama-200b` is a 1.3 billion parameter decoder-only transformer model trained by [MosaicML](https://aimodels.fyi/creators/huggingFace/mosaicml) on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T). It follows a modified decoder-only transformer architecture, using techniques like [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALIBI](https://arxiv.org/abs/2108.12409), and QK LayerNorm. This model was trained for 200 billion tokens, with the dataset mix similar to the [Llama series of models](https://arxiv.org/abs/2302.13971).

## Model inputs and outputs

The `mpt-1b-redpajama-200b` is a causal language model that takes in text and generates continuations of that text. It can be used for a variety of natural language processing tasks, such as text generation, summarization, and translation.

### Inputs
- Raw text that the model will use to generate continuations

### Outputs
- Continued text generated by the model based on the input

## Capabilities

The `mpt-1b-redpajama-200b` model has been trained on a large and diverse corpus of text, giving it broad capabilities in natural language understanding and generation. It can be used for tasks like creative writing, summarization, and open-ended conversation.

## What can I use it for?

The `mpt-1b-redpajama-200b` model can be used for a variety of natural language processing tasks, such as:

- **Text generation**: Use the model to generate coherent and contextually relevant text continuations, such as stories, articles, or dialogue.
- **Summarization**: Feed the model long-form text and have it generate concise summaries.
- **Conversational AI**: Fine-tune the model on conversational data to create chatbots and virtual assistants.

## Things to try

One interesting thing to try with the `mpt-1b-redpajama-200b` model is to experiment with the different architectural modifications, such as the use of ALIBI and the elimination of positional embeddings. This can help you understand how these choices impact the model's performance and capabilities.

Another idea is to fine-tune the model on a specific domain or task, leveraging its broad knowledge base to create a specialized model tailored to your needs. The [MosaicML Platform](https://www.mosaicml.com/platform) offers tools and resources to help with this process.

[](#mpt-1b-redpajama-200b-dolly)MPT-1b-RedPajama-200b-dolly
===========================================================

MPT-1b-RedPajama-200b-dolly is a 1.3 billion parameter decoder-only transformer pre-trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and subsequently fine-tuned on the [Databricks Dolly](https://github.com/databrickslabs/dolly/tree/master/data) instruction dataset. The model was pre-trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the [Llama series of models](https://arxiv.org/abs/2302.13971). This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.

This model is an instruction fine-tuned version of [mpt-1b-redpajama-200b](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b). In other words, the pre-trained version of this model is [mpt-1b-redpajama-200b](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b).

[](#model-date)Model Date
-------------------------

April 20, 2023

[](#how-to-use)How to Use
-------------------------

Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom model architecture `MosaicGPT` that is not yet part of the `transformers` package. `MosaicGPT` includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALIBI](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.

    import transformers
    model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b-dolly', trust_remote_code=True)
    

To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:

    model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-1b-redpajama-200b-dolly', trust_remote_code=True, attn_impl='triton')
    model.to(device='cuda:0', dtype=torch.bfloat16)
    

[](#model-description)Model Description
---------------------------------------

This model uses the MosaicML LLM codebase, which can be found in the [MosaicML Examples Repository](https://github.com/mosaicml/examples/tree/v0.0.4/examples/llm). The architecture is a modification of a standard decoder-only transformer. The transformer has 24 layers, 16 attention heads, and width 2048. The model has been modified from a standard transformer in the following ways:

*   It uses ALiBi and does not use positional embeddings.
*   It uses QK LayerNorm.
*   It does not use biases.

[](#training-data)Training Data
-------------------------------

### [](#pre-training)Pre-Training

The model was pre-trained for 200B tokens (batch size 2200, sequence length 2048). It was trained on the following data mix:

*   67% RedPajama Common Crawl
*   15% [C4](https://huggingface.co/datasets/c4)
*   4.5% RedPajama GitHub
*   4.5% RedPajama Wikipedia
*   4.5% RedPajama Books
*   2.5% RedPajama Arxiv
*   2% RedPajama StackExchange

This is the same mix of data as was used in the Llama series of models\]([https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)).

Each sample was chosen from one of the datasets, with the dataset selected with the probability specified above. The examples were shuffled within each dataset. Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.

### [](#fine-tuning)Fine-Tuning

We fine tuned this model on the [databricks-dolly-15k dataset](https://github.com/databrickslabs/dolly/tree/master/data) released by Databricks, following the same hyperparameters found in their [train\_dolly.py](https://github.com/databrickslabs/dolly/blob/master/train_dolly.py) script.

[](#training-configuration)Training Configuration
-------------------------------------------------

This model was pre-trained on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was pre-trained with sharded data parallelism using FSDP.

[](#acknowledgements)Acknowledgements
-------------------------------------

This model builds on the work of [Together](https://www.together.xyz), which created the RedPajama dataset with the goal of mimicking the training data used to create the Llama series of models. We gratefully acknowledge the hard work of the team that put together this dataset, and we hope this model serves as a useful companion to that work.

This model also builds on the work of [Databricks](https://www.databricks.com/), which created the Dolly instruction fine-tuning dataset.

We also gratefully acknowledge the work of the researchers who created the Llama series of models, which was the impetus for our efforts and those who worked on the RedPajama project.

## Model Overview

`mpt-1b-redpajama-200b-dolly` is a 1.3 billion parameter decoder-only transformer model that was pre-trained on the [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and then fine-tuned on the [Databricks Dolly](https://github.com/databrickslabs/dolly/tree/master/data) instruction dataset. This model was trained by [MosaicML](https://www.mosaicml.com), a company focused on developing efficient and capable AI models. 

The `mpt-1b-redpajama-200b` model, which serves as the base for this fine-tuned version, was pre-trained for 200B tokens using the same data proportions as the [Llama series of models](https://arxiv.org/abs/2302.13971). The architecture of this model follows a modified decoder-only transformer design, incorporating features like [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALIBI](https://arxiv.org/abs/2108.12409), and QK LayerNorm.

## Model Inputs and Outputs

### Inputs
- Text prompts that describe a task or request

### Outputs
- Responses that appropriately complete the requested task

## Capabilities

`mpt-1b-redpajama-200b-dolly` is an instruction-following model that can perform a wide variety of tasks based on the input prompt, such as answering questions, writing reports, generating creative stories, and providing analysis. The model's training on the Databricks Dolly dataset helps it understand and follow complex instructions reliably.

## What Can I Use It For?

This model could be useful for automating various text-based workflows within a company, such as customer service, content creation, or data analysis. By providing clear instructions, employees can leverage the model to save time and improve consistency. Additionally, the model's open-source nature and commercial use license make it accessible for companies to fine-tune on their own proprietary data.

## Things to Try

One interesting aspect of `mpt-1b-redpajama-200b-dolly` is its ability to handle extremely long input context, thanks to the use of ALIBI. This could allow for tasks that require synthesizing information from large amounts of text, such as summarizing research papers or generating long-form creative writing. Experimenting with providing the model with extended context and observing its responses could yield interesting results.