[](#-falcon-40b-instruct) Falcon-40B-Instruct
===============================================

**Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by [TII](https://www.tii.ae) based on [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b) and finetuned on a mixture of [Baize](https://github.com/project-baize/baize-chatbot). It is made available under the Apache 2.0 license.**

_Paper coming soon ._

 To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost fron HF](https://huggingface.co/blog/falcon)!

[](#why-use-falcon-40b-instruct)Why use Falcon-40B-Instruct?
------------------------------------------------------------

*   **You are looking for a ready-to-use chat/instruct model based on [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b).**
*   **Falcon-40B is the best open-source model available.** It outperforms [LLaMA](https://github.com/facebookresearch/llama), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1), [MPT](https://huggingface.co/mosaicml/mpt-7b), etc. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
*   **It features an architecture optimized for inference**, with FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135)) and multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)).

 **This is an instruct model, which may not be ideal for further finetuning.** If you are interested in building your own instruct/chat model, we recommend starting from [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b).

 **Looking for a smaller, less expensive model?** [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) is Falcon-40B-Instruct's little brother!

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-40b-instruct"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost](/tiiuae/falcon-40b-instruct/blob/main/(https://huggingface.co/blog/falcon).

You will need **at least 85-100GB of memory** to swiftly run inference with Falcon-40B.

[](#model-card-for-falcon-40b-instruct)Model Card for Falcon-40B-Instruct
=========================================================================

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

*   **Developed by:** [https://www.tii.ae](https://www.tii.ae);
*   **Model type:** Causal decoder-only;
*   **Language(s) (NLP):** English and French;
*   **License:** Apache 2.0;
*   **Finetuned from model:** [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b).

### [](#model-source)Model Source

*   **Paper:** _coming soon_.

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

Falcon-40B-Instruct has been finetuned on a chat dataset.

### [](#out-of-scope-use)Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

Falcon-40B-Instruct is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### [](#recommendations)Recommendations

We recommend users of Falcon-40B-Instruct to develop guardrails and to take appropriate precautions for any production use.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-40b-instruct"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

[](#training-details)Training Details
-------------------------------------

### [](#training-data)Training Data

Falcon-40B-Instruct was finetuned on a 150M tokens from [Bai ze](https://github.com/project-baize/baize-chatbot) mixed with 5% of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) data.

The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) tokenizer.

[](#evaluation)Evaluation
-------------------------

_Paper coming soon._

See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) for early results.

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

For more information about pretraining, see [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b).

### [](#model-architecture-and-objective)Model Architecture and Objective

Falcon-40B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:

*   **Positionnal embeddings:** rotary ([Su et al., 2021](https://arxiv.org/abs/2104.09864));
*   **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135));
*   **Decoder-block:** parallel attention/MLP with a single layer norm.

For multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree.

**Hyperparameter**

**Value**

**Comment**

Layers

60

`d_model`

8192

`head_dim`

64

Reduced to optimise for FlashAttention

Vocabulary

65024

Sequence length

2048

### [](#compute-infrastructure)Compute Infrastructure

#### [](#hardware)Hardware

Falcon-40B-Instruct was trained on AWS SageMaker, on 64 A100 40GB GPUs in P4d instances.

#### [](#software)Software

Falcon-40B-Instruct was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

[](#citation)Citation
---------------------

_Paper coming soon_ . In the meanwhile, you can use the following information to cite:

    @article{falcon40b,
      title={{Falcon-40B}: an open large language model with state-of-the-art performance},
      author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
      year={2023}
    }
    

To learn more about the pretraining dataset, see the  [RefinedWeb paper](https://arxiv.org/abs/2306.01116).

    @article{refinedweb,
      title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
      author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
      journal={arXiv preprint arXiv:2306.01116},
      eprint={2306.01116},
      eprinttype = {arXiv},
      url={https://arxiv.org/abs/2306.01116},
      year={2023}
    }
    

To cite the [Baize](https://github.com/project-baize/baize-chatbot) instruction dataset used for this model:

    @article{xu2023baize,
      title={Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data},
      author={Xu, Canwen and Guo, Daya and Duan, Nan and McAuley, Julian},
      journal={arXiv preprint arXiv:2304.01196},
      year={2023}
    }
    

[](#license)License
-------------------

Falcon-40B-Instruct is made available under the Apache 2.0 license.

[](#contact)Contact
-------------------

[falconllm@tii.ae](mailto:falconllm@tii.ae)

## Model overview

`Falcon-40B-Instruct` is a 40 billion parameter causal decoder-only model built by [TII](https://www.tii.ae) that has been finetuned on a mixture of [Baize](https://github.com/project-baize/baize-chatbot) to make it more suitable for taking instructions in a chat format. It is an extension of the base [Falcon-40B](https://aimodels.fyi/models/huggingFace/falcon-40b-tiiuae) model, which is currently the best open-source large language model available. The Falcon-40B-Instruct model outperforms other instruction-tuned models like [LLaMA](https://github.com/facebookresearch/llama), [StableLM](https://github.com/Stability-AI/StableLM), and [MPT](https://huggingface.co/mosaicml/mpt-7b).

## Model inputs and outputs

Falcon-40B-Instruct is a large language model that can generate human-like text based on provided inputs. It uses an autoregressive architecture, meaning it predicts the next word in a sequence based on the previous words.

### Inputs
- **Text prompts**: The model takes natural language text prompts as input, which can range from a single sentence to multiple paragraphs.

### Outputs
- **Generated text**: The model outputs human-like text continuations based on the provided prompts. The generated text can be used for a variety of applications such as chatbots, content generation, and creative writing assistance.

## Capabilities

Falcon-40B-Instruct demonstrates strong performance on a range of language tasks, including open-ended conversation, question answering, summarization, and task completion. It can engage in contextual back-and-forth exchanges, understand nuanced language, and generate coherent and relevant responses. The model's large size and specialized finetuning allow it to draw upon a vast knowledge base to reason about complex topics and provide substantive, informative outputs.

## What can I use it for?

The Falcon-40B-Instruct model is well-suited for applications that require a capable, open-domain language model with strong instruction-following abilities. Potential use cases include:

- **Chatbots and virtual assistants**: Falcon-40B-Instruct can power conversational AI agents that can engage in natural, open-ended dialogue and assist users with a variety of tasks.
- **Content generation**: The model can be used to generate text for creative writing, article summaries, product descriptions, and other applications where high-quality, human-like text is needed.
- **Task completion**: Falcon-40B-Instruct can understand and execute a wide range of instructions, making it useful for applications that involve following complex multi-step commands.

## Things to try

One interesting aspect of Falcon-40B-Instruct is its ability to engage in extended, contextual exchanges. Try prompting the model with a series of related questions or instructions, and see how it maintains coherence and builds upon the previous context. You can also experiment with prompts that require nuanced reasoning or creativity, as the model's specialized finetuning may allow it to provide more insightful and engaging responses compared to a base language model.