[](#stable-lm-2-12b)`Stable LM 2 12B`
=====================================

[](#model-description)Model Description
---------------------------------------

`Stable LM 2 12B` is a 12.1 billion parameter decoder-only language model pre-trained on 2 trillion tokens of diverse multilingual and code datasets for two epochs.

Please note: For commercial use, please refer to [https://stability.ai/membership](https://stability.ai/membership).

[](#usage)Usage
---------------

**NOTE**: This model requires `transformers>=4.40.0`

Get started generating text with `Stable LM 2 12B` by using the following code snippet:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-2-12b")
    model = AutoModelForCausalLM.from_pretrained(
      "stabilityai/stablelm-2-12b",
      torch_dtype="auto",
    )
    model.cuda()
    inputs = tokenizer("The weather is always wonderful", return_tensors="pt").to(model.device)
    tokens = model.generate(
      **inputs,
      max_new_tokens=64,
      temperature=0.70,
      top_p=0.95,
      do_sample=True,
    )
    print(tokenizer.decode(tokens[0], skip_special_tokens=True))
    

### [](#run-with-flash-attention-2-)Run with Flash Attention 2 

Click to expand

    from transformers import AutoModelForCausalLM, AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-2-12b")
    model = AutoModelForCausalLM.from_pretrained(
      "stabilityai/stablelm-2-12b",
      torch_dtype="auto",
      attn_implementation="flash_attention_2",
    )
    model.cuda()
    inputs = tokenizer("The weather is always wonderful", return_tensors="pt").to(model.device)
    tokens = model.generate(
      **inputs,
      max_new_tokens=64,
      temperature=0.70,
      top_p=0.95,
      do_sample=True,
    )
    print(tokenizer.decode(tokens[0], skip_special_tokens=True))

[](#model-details)Model Details
-------------------------------

*   **Developed by**: [Stability AI](https://stability.ai/)
*   **Model type**: `Stable LM 2 12B` models are auto-regressive language models based on the transformer decoder architecture.
*   **Language(s)**: English
*   **Paper**: [Stable LM 2 Technical Report](https://arxiv.org/abs/2402.17834)
*   **Library**: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
*   **License**: [Stability AI Non-Commercial Research Community License](https://huggingface.co/stabilityai/stablelm-2-12b/blob/main/LICENSE).
*   **Commercial License**: to use this model commercially, please refer to [https://stability.ai/membership](https://stability.ai/membership)
*   **Contact**: For questions and comments about the model, please email `lm@stability.ai`

### [](#model-architecture)Model Architecture

The model is a decoder-only transformer with the following architecture:

Parameters

Hidden Size

Layers

Heads

KV Heads

Sequence Length

12,143,605,760

5120

40

32

8

4096

*   **Position Embeddings**: Rotary Position Embeddings ([Su et al., 2021](https://arxiv.org/abs/2104.09864)) applied to the first 25% of head embedding dimensions for improved throughput following [Black et al. (2022)](https://arxiv.org/pdf/2204.06745.pdf).
*   **Parallel Layers**: Parallel attention and feed-forward residual layers with a single input LayerNorm ([Wang, 2021](https://github.com/kingoflolz/mesh-transformer-jax)).
*   **Normalization**: LayerNorm ([Ba et al., 2016](https://arxiv.org/abs/1607.06450)) without biases. Furthermore, we apply per-head QK normalization ([Dehghani et al., 2023](https://arxiv.org/abs/2302.05442), [Wortsman et al., 2023](https://arxiv.org/abs/2309.14322)).
*   **Biases**: We remove all bias terms from the feed-forward networks and grouped-query self-attention layers.
*   **Tokenizer**: We use Arcade100k, a BPE tokenizer extended from OpenAI's [`tiktoken.cl100k_base`](https://github.com/openai/tiktoken). We split digits into individual tokens following findings by [Liu & Low (2023)](https://arxiv.org/abs/2305.14201).

[](#training)Training
---------------------

### [](#training-dataset)Training Dataset

The dataset is comprised of a filtered mixture of open-source large-scale datasets available on the [HuggingFace Hub](https://huggingface.co/datasets): Falcon RefinedWeb extract ([Penedo et al., 2023](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)), RedPajama-Data ([Together Computer., 2023](https://github.com/togethercomputer/RedPajama-Data)) and The Pile ([Gao et al., 2020](https://arxiv.org/abs/2101.00027)) both without the _Books3_ subset, and StarCoder ([Li et al., 2023](https://arxiv.org/abs/2305.06161)). We further supplement our training with multi-lingual data from CulturaX ([Nguyen et al., 2023](https://arxiv.org/abs/2309.09400)) and, in particular, from its OSCAR corpora, as well as restructured data in the style of [Yuan & Liu (2022)](https://arxiv.org/abs/2206.11147).

*   Given the large amount of web data, we recommend fine-tuning the base `Stable LM 2 12B` for your downstream tasks.

### [](#training-procedure)Training Procedure

The model is pre-trained on the aforementioned datasets in `bfloat16` precision, optimized with AdamW, and trained using the Arcade100k tokenizer with a vocabulary size of 100,352. We outline the complete hyperparameters choices in the project's [GitHub repository - config\*](https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-2-12b.yml).

### [](#training-infrastructure)Training Infrastructure

*   **Hardware**: `Stable LM 2 12B` was trained on the Stability AI cluster across 384 NVIDIA H100 GPUs (AWS P5 instances).
    
*   **Software**: We use a fork of `gpt-neox` ([EleutherAI, 2021](https://github.com/EleutherAI/gpt-neox)), train under 2D parallelism (Data and Tensor Parallel) with ZeRO-1 ([Rajbhandari et al., 2019](https://arxiv.org/abs/1910.02054v3)), and rely on flash-attention as well as SwiGLU and Rotary Embedding kernels from FlashAttention-2 ([Dao et al., 2023](https://tridao.me/publications/flash2/flash2.pdf))
    

[](#use-and-limitations)Use and Limitations
-------------------------------------------

### [](#intended-use)Intended Use

The model is intended to be used as a foundational base model for application-specific fine-tuning. Developers must evaluate and fine-tune the model for safe performance in downstream applications. For commercial use, please refer to [https://stability.ai/membership](https://stability.ai/membership).

### [](#limitations-and-bias)Limitations and Bias

 As a base model, this model may exhibit unreliable, unsafe, or other undesirable behaviors that must be corrected through evaluation and fine-tuning prior to deployment. The pre-training dataset may have contained offensive or inappropriate content, even after applying data cleansing filters, which can be reflected in the model-generated text. We recommend that users exercise caution when using these models in production systems. Do not use the models if they are unsuitable for your application, or for any applications that may cause deliberate or unintentional harm to others.

[](#how-to-cite)How to Cite
---------------------------

    @article{bellagente2024stable,
      title={Stable LM 2 1.6 B Technical Report},
      author={Bellagente, Marco and Tow, Jonathan and Mahan, Dakota and Phung, Duy and Zhuravinskyi, Maksym and Adithyan, Reshinth and Baicoianu, James and Brooks, Ben and Cooper, Nathan and Datta, Ashish and others},
      journal={arXiv preprint arXiv:2402.17834},
      year={2024}
    }

## Model overview

`Stable LM 2 12B` is a 12.1 billion parameter decoder-only language model developed by [Stability AI](https://stability.ai/). It was pre-trained on 2 trillion tokens of diverse multilingual and code datasets for two epochs. The model is part of the Stable LM 2 series, which also includes the [Stable LM 2 1.6B](https://aimodels.fyi/models/huggingFace/stablelm-2-16b-stabilityai) and [Stable Code 3B](https://aimodels.fyi/models/huggingFace/stable-code-3b-stabilityai) models. Compared to the smaller 1.6B version, the 12B model has significantly more parameters and demonstrates improved performance on various benchmarks.

## Model inputs and outputs

The `Stable LM 2 12B` model is a text generation model that takes natural language prompts as input and generates coherent, contextual text output. The model can be used for a variety of natural language tasks, such as summarization, translation, and open-ended generation.

### Inputs
- Natural language prompts in various languages, with a focus on English

### Outputs
- Coherent, context-aware text generated in response to the input prompts
- The model can generate text of varying lengths, from short phrases to multi-paragraph passages

## Capabilities

The `Stable LM 2 12B` model demonstrates strong performance on a range of natural language tasks, including open-ended generation, summarization, and translation. It can be used to generate human-like text on a variety of topics, from creative writing to technical documentation. The model's large size and diverse training data allow it to capture a wide range of linguistic patterns and knowledge.

## What can I use it for?

`Stable LM 2 12B` can be a powerful tool for developers and researchers working on natural language processing applications. Some potential use cases include:

- Content generation: The model can be used to generate original text for applications like creative writing, article generation, and chatbots.
- Summarization: The model can be fine-tuned to summarize longer passages of text, making it useful for tasks like document summarization.
- Translation: The multilingual capabilities of the model can be leveraged for machine translation between supported languages.
- Knowledge-based applications: The model's broad training data can be leveraged to build applications that require access to a wide range of information, such as question-answering systems.

However, as a large language model, `Stable LM 2 12B` may exhibit biases or generate unsafe content. Users should carefully evaluate the model's outputs and consider potential risks before deploying it in production systems.

## Things to try

Some interesting things to try with `Stable LM 2 12B` include:

- Experimenting with different prompting and generation strategies to explore the model's capabilities in areas like creative writing, task completion, and open-ended dialogue.
- Fine-tuning the model on domain-specific datasets to adapt it for specialized applications, such as technical writing or customer service chatbots.
- Combining the model with other AI components, such as vision models or recommender systems, to build more complex, multimodal applications.
- Investigating the model's reasoning and knowledge capabilities by probing it with a variety of questions and tasks.

As with any powerful AI system, it's important to use `Stable LM 2 12B` responsibly and with appropriate safeguards in place. Continuous evaluation and refinement will be crucial to ensuring the model's outputs are safe, ethical, and aligned with user needs.