[](#model-card-for-jamba)Model Card for Jamba
=============================================

Jamba is a state-of-the-art, hybrid SSM-Transformer LLM. It delivers throughput gains over traditional Transformer-based models, while outperforming or matching the leading models of its size class on most common benchmarks.

Jamba is the first production-scale Mamba implementation, which opens up interesting research and application opportunities. While this initial experimentation shows encouraging gains, we expect these to be further enhanced with future optimizations and explorations.

This model card is for the base version of Jamba. Its a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and a total of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.

For full details of this model please read the [white paper](https://arxiv.org/abs/2403.19887) and the [release blog post](https://www.ai21.com/blog/announcing-jamba).

[](#model-details)Model Details
-------------------------------

*   **Developed by:** [AI21](https://www.ai21.com)
*   **Model type:** Joint Attention and Mamba (Jamba)
*   **License:** Apache 2.0
*   **Context length:** 256K
*   **Knowledge cutoff date:** March 5, 2024

[](#usage)Usage
---------------

### [](#presequities)Presequities

Jamba requires you use `transformers` version 4.39.0 or higher:

    pip install transformers>=4.39.0
    

In order to run optimized Mamba implementations, you first need to install `mamba-ssm` and `causal-conv1d`:

    pip install mamba-ssm causal-conv1d>=1.2.0
    

You also have to have the model on a CUDA device.

You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model.

### [](#run-the-model)Run the model

Please note that, at the moment, `trust_remote_code=True` is required for running the new Jamba architecture.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                                 trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")
    
    input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]
    
    outputs = model.generate(input_ids, max_new_tokens=216)
    
    print(tokenizer.batch_decode(outputs))
    # ["<|startoftext|>In the recent Super Bowl LVIII, the Kansas City Chiefs emerged victorious, defeating the San Francisco 49ers in a thrilling overtime showdown. The game was a nail-biter, with both teams showcasing their skills and determination.\n\nThe Chiefs, led by their star quarterback Patrick Mahomes, displayed their offensive prowess, while the 49ers, led by their strong defense, put up a tough fight. The game went into overtime, with the Chiefs ultimately securing the win with a touchdown.\n\nThe victory marked the Chiefs' second Super Bowl win in four years, solidifying their status as one of the top teams in the NFL. The game was a testament to the skill and talent of both teams, and a thrilling end to the NFL season.\n\nThe Super Bowl is not just about the game itself, but also about the halftime show and the commercials. This year's halftime show featured a star-studded lineup, including Usher, Alicia Keys, and Lil Jon. The show was a spectacle of music and dance, with the performers delivering an energetic and entertaining performance.\n"]
    

**Loading the model in half precision**

The published checkpoint is saved in BF16. In order to load it into RAM in BF16/FP16, you need to specify `torch_dtype`:

    from transformers import AutoModelForCausalLM
    import torch
    model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                                 trust_remote_code=True,
                                                 torch_dtype=torch.bfloat16)    # you can also use torch_dtype=torch.float16
    

When using half precision, you can enable the [FlashAttention2](https://github.com/Dao-AILab/flash-attention) implementation of the Attention blocks. In order to use it, you also need the model on a CUDA device. Since in this precision the model is to big to fit on a single 80GB GPU, you'll also need to parallelize it using [accelerate](https://huggingface.co/docs/accelerate/index):

    from transformers import AutoModelForCausalLM
    import torch
    model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                                 trust_remote_code=True,
                                                 torch_dtype=torch.bfloat16,
                                                 attn_implementation="flash_attention_2",
                                                 device_map="auto")
**Load the model in 8-bit**

**Using 8-bit precision, it is possible to fit up to 140K sequence lengths on a single 80GB GPU.** You can easily quantize the model to 8-bit using [bitsandbytes](https://huggingface.co/docs/bitsandbytes/index). In order to not degrade model quality, we recommend to exclude the Mamba blocks from the quantization:

    from transformers import AutoModelForCausalLM, BitsAndBytesConfig
    quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                             llm_int8_skip_modules=["mamba"])
    model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                                 trust_remote_code=True,
                                                 torch_dtype=torch.bfloat16,
                                                 attn_implementation="flash_attention_2",
                                                 quantization_config=quantization_config)

### [](#fine-tuning-example)Fine-tuning example

Jamba is a base model that can be fine-tuned for custom solutions (including for chat/instruct versions). You can fine-tune it using any technique of your choice. Here is an example of fine-tuning with the [PEFT](https://huggingface.co/docs/peft/index) library:

    from datasets import load_dataset
    from trl import SFTTrainer
    from peft import LoraConfig
    from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
    
    tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")
    model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1", trust_remote_code=True, device_map='auto')
    
    dataset = load_dataset("Abirate/english_quotes", split="train")
    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        logging_dir='./logs',
        logging_steps=10,
        learning_rate=2e-3
    )
    lora_config = LoraConfig(
        r=8,
        target_modules=["embed_tokens", "x_proj", "in_proj", "out_proj"],
        task_type="CAUSAL_LM",
        bias="none"
    )
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        peft_config=lora_config,
        train_dataset=dataset,
        dataset_text_field="quote",
    )
    
    trainer.train()
    

[](#results-on-common-benchmarks)Results on common benchmarks
-------------------------------------------------------------

Benchmark

Score

HellaSwag

87.1%

Arc Challenge

64.4%

WinoGrande

82.5%

PIQA

83.2%

MMLU

67.4%

BBH

45.4%

TruthfulQA

46.4%

GSM8K (CoT)

59.9%

It's crucial that the 'BOS' token is added to all prompts, which might not be enabled by default in all eval frameworks.

[](#notice)Notice
-----------------

Jamba is a pretrained base model and did not undergo any alignment for instruct/chat interactions.

As a base model, Jamba is intended for use as a foundation layer for fine tuning, training, and developing custom solutions. Jamba does not have safety moderation mechanisms and guardrails should be added for responsible and safe use.

[](#about-ai21)About AI21
-------------------------

AI21 builds reliable, practical, and scalable AI solutions for the enterprise.

Jamba is the first in AI21s new family of models, and the Instruct version of Jamba is coming soon to the [AI21 platform](https://www.ai21.com/studio).

## Model Overview

`Jamba-v0.1` is a state-of-the-art, hybrid SSM-Transformer large language model (LLM) developed by [AI21 Labs](https://aimodels.fyi/creators/huggingFace/ai21labs). It delivers throughput gains over traditional Transformer-based models, while outperforming or matching the leading models of its size class on most common benchmarks. `Jamba` is the first production-scale Mamba implementation, which opens up interesting research and application opportunities.

Similar models like [mamba-2.8b-instruct-openhermes](https://aimodels.fyi/models/huggingFace/mamba-28b-instruct-openhermes-clibrain), [mamba-2.8b-hf](https://aimodels.fyi/models/huggingFace/mamba-28b-hf-state-spaces), and [mamba-2.8b-slimpj](https://aimodels.fyi/models/huggingFace/mamba-28b-slimpj-state-spaces) also utilize the Mamba architecture, with varying parameter sizes and training datasets.

## Model Inputs and Outputs

`Jamba-v0.1` is a pretrained, mixture-of-experts (MoE) generative text model. It supports a 256K context length and can fit up to 140K tokens on a single 80GB GPU.

### Inputs
- Text prompts of up to 256K tokens

### Outputs
- Continuation of the input text, generating new tokens based on the provided context

## Capabilities

`Jamba-v0.1` is a powerful language model that can be used for a variety of text-generation tasks. It has demonstrated strong performance on common benchmarks, outperforming or matching leading models of similar size. The hybrid SSM-Transformer architecture allows for improved throughput compared to traditional Transformer-based models.

## What Can I Use It For?

The capabilities of `Jamba-v0.1` make it a versatile model that can be used for many text-to-text tasks, such as:

- **Content Generation**: Write articles, stories, scripts, and other types of long-form text with high quality and coherence.
- **Dialogue Systems**: Build chatbots and virtual assistants that can engage in natural, contextual conversations.
- **Question Answering**: Answer questions on a wide range of topics by leveraging the model's broad knowledge base.
- **Summarization**: Condense long passages of text into concise, informative summaries.

Given its strong performance, `Jamba-v0.1` can be a valuable tool for businesses, researchers, and developers looking to push the boundaries of what's possible with large language models.

## Things to Try

One interesting aspect of `Jamba-v0.1` is its hybrid SSM-Transformer architecture, which combines the strengths of structured state space models and traditional Transformers. Exploring how this architectural choice affects the model's performance, especially on tasks that require long-range dependencies or efficient processing, could yield valuable insights.

Additionally, the Mamba implementation used in `Jamba-v0.1` opens up new research opportunities. Investigating how this subquadratic model compares to other state-of-the-art language models, both in terms of raw performance and computational efficiency, could help advance the field of large language models.