[](#-falcon-40b) Falcon-40B
===============================

**Falcon-40B is a 40B parameters causal decoder-only model built by [TII](https://www.tii.ae) and trained on 1,000B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. It is made available under the Apache 2.0 license.**

_Paper coming soon ._

 To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost fron HF](https://huggingface.co/blog/falcon)!

[](#why-use-falcon-40b)Why use Falcon-40B?
------------------------------------------

*   **It is the best open-source model currently available.** Falcon-40B outperforms [LLaMA](https://github.com/facebookresearch/llama), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1), [MPT](https://huggingface.co/mosaicml/mpt-7b), etc. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
*   **It features an architecture optimized for inference**, with FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135)) and multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)).
*   **It is made available under a permissive Apache 2.0 license allowing for commercial use**, without any royalties or restrictions.
*    **This is a raw, pretrained model, which should be further finetuned for most usecases.** If you are looking for a version better suited to taking generic instructions in a chat format, we recommend taking a look at [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).

 **Looking for a smaller, less expensive model?** [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) is Falcon-40B's little brother!

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-40b"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

 **Falcon LLMs require PyTorch 2.0 for use with `transformers`!**

For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost](/tiiuae/falcon-40b/blob/main/(https://huggingface.co/blog/falcon).

You will need **at least 85-100GB of memory** to swiftly run inference with Falcon-40B.

[](#model-card-for-falcon-40b)Model Card for Falcon-40B
=======================================================

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

*   **Developed by:** [https://www.tii.ae](https://www.tii.ae);
*   **Model type:** Causal decoder-only;
*   **Language(s) (NLP):** English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish);
*   **License:** Apache 2.0 license.

### [](#model-source)Model Source

*   **Paper:** _coming soon_.

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)

### [](#out-of-scope-use)Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

Falcon-40B is trained mostly on English, German, Spanish, French, with limited capabilities also in in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### [](#recommendations)Recommendations

We recommend users of Falcon-40B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-40b"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

[](#training-details)Training Details
-------------------------------------

### [](#training-data)Training Data

Falcon-40B was trained on 1,000B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. Significant components from our curated copora were inspired by The Pile ([Gao et al., 2020](https://arxiv.org/abs/2101.00027)).

**Data source**

**Fraction**

**Tokens**

**Sources**

[RefinedWeb-English](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)

75%

750B

massive web crawl

RefinedWeb-Europe

7%

70B

European massive web crawl

Books

6%

60B

Conversations

5%

50B

Reddit, StackOverflow, HackerNews

Code

5%

50B

Technical

2%

20B

arXiv, PubMed, USPTO, etc.

RefinedWeb-Europe is made of the following languages:

**Language**

**Fraction of multilingual data**

**Tokens**

German

26%

18B

Spanish

24%

17B

French

23%

16B

_Italian_

7%

5B

_Portuguese_

4%

3B

_Polish_

4%

3B

_Dutch_

4%

3B

_Romanian_

3%

2B

_Czech_

3%

2B

_Swedish_

2%

1B

The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) tokenizer.

### [](#training-procedure)Training Procedure

Falcon-40B was trained on 384 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=12) combined with ZeRO.

#### [](#training-hyperparameters)Training Hyperparameters

**Hyperparameter**

**Value**

**Comment**

Precision

`bfloat16`

Optimizer

AdamW

Learning rate

1.85e-4

4B tokens warm-up, cosine decay to 1.85e-5

Weight decay

1e-1

Z-loss

1e-4

Batch size

1152

100B tokens ramp-up

#### [](#speeds-sizes-times)Speeds, Sizes, Times

Training started in December 2022 and took two months.

[](#evaluation)Evaluation
-------------------------

_Paper coming soon._

See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) for early results.

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

### [](#model-architecture-and-objective)Model Architecture and Objective

Falcon-40B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:

*   **Positionnal embeddings:** rotary ([Su et al., 2021](https://arxiv.org/abs/2104.09864));
*   **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135));
*   **Decoder-block:** parallel attention/MLP with a two layer norms.

For multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree.

**Hyperparameter**

**Value**

**Comment**

Layers

60

`d_model`

8192

`head_dim`

64

Reduced to optimise for FlashAttention

Vocabulary

65024

Sequence length

2048

### [](#compute-infrastructure)Compute Infrastructure

#### [](#hardware)Hardware

Falcon-40B was trained on AWS SageMaker, on 384 A100 40GB GPUs in P4d instances.

#### [](#software)Software

Falcon-40B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

[](#citation)Citation
---------------------

_Paper coming soon_ . In the meanwhile, you can use the following information to cite:

    @article{falcon40b,
      title={{Falcon-40B}: an open large language model with state-of-the-art performance},
      author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
      year={2023}
    }
    

To learn more about the pretraining dataset, see the  [RefinedWeb paper](https://arxiv.org/abs/2306.01116).

    @article{refinedweb,
      title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
      author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
      journal={arXiv preprint arXiv:2306.01116},
      eprint={2306.01116},
      eprinttype = {arXiv},
      url={https://arxiv.org/abs/2306.01116},
      year={2023}
    }
    

[](#license)License
-------------------

Falcon-40B is made available under the Apache 2.0 license.

[](#contact)Contact
-------------------

[falconllm@tii.ae](mailto:falconllm@tii.ae)

## Model overview

The `falcon-40b` is a 40 billion parameter causal decoder-only language model developed by [TII](https://www.tii.ae). It was trained on 1,000 billion tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. The `falcon-40b` outperforms other open-source models like [LLaMA](https://github.com/facebookresearch/llama), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1), and [MPT](https://huggingface.co/mosaicml/mpt-7b) according to the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). It features an architecture optimized for inference, with FlashAttention and multiquery. The `falcon-40b` is available under a permissive Apache 2.0 license, allowing for commercial use without royalties or restrictions.

## Model inputs and outputs

### Inputs
- **Text**: The `falcon-40b` model takes text as input.

### Outputs
- **Text**: The `falcon-40b` model generates text as output.

## Capabilities

The `falcon-40b` is a powerful language model capable of a wide range of natural language processing tasks. It can be used for tasks like language generation, question answering, and text summarization. The model's strong performance on benchmarks suggests it could be useful for applications that require high-quality text generation.

## What can I use it for?

With its large scale and robust performance, the `falcon-40b` model could be useful for a variety of applications. For example, it could be used to build AI writing assistants, chatbots, or content generation tools. Additionally, the model could be fine-tuned on domain-specific data to create specialized language models for fields like healthcare, finance, or research. The permissive license also makes the `falcon-40b` an attractive option for commercial use cases.

## Things to try

One interesting aspect of the `falcon-40b` is its architecture optimized for inference, with FlashAttention and multiquery. This suggests the model may be able to generate text quickly and efficiently, making it well-suited for real-time applications. Developers could experiment with using the `falcon-40b` in low-latency scenarios, such as interactive chatbots or live content generation.

Additionally, the model's strong performance on benchmarks indicates it may be a good starting point for further fine-tuning and customization. Researchers and practitioners could explore fine-tuning the `falcon-40b` on domain-specific data to create specialized language models for their particular use cases.

[](#-falcon-40b-instruct) Falcon-40B-Instruct
===============================================

**Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by [TII](https://www.tii.ae) based on [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b) and finetuned on a mixture of [Baize](https://github.com/project-baize/baize-chatbot). It is made available under the Apache 2.0 license.**

_Paper coming soon ._

 To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost fron HF](https://huggingface.co/blog/falcon)!

[](#why-use-falcon-40b-instruct)Why use Falcon-40B-Instruct?
------------------------------------------------------------

*   **You are looking for a ready-to-use chat/instruct model based on [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b).**
*   **Falcon-40B is the best open-source model available.** It outperforms [LLaMA](https://github.com/facebookresearch/llama), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1), [MPT](https://huggingface.co/mosaicml/mpt-7b), etc. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
*   **It features an architecture optimized for inference**, with FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135)) and multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)).

 **This is an instruct model, which may not be ideal for further finetuning.** If you are interested in building your own instruct/chat model, we recommend starting from [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b).

 **Looking for a smaller, less expensive model?** [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) is Falcon-40B-Instruct's little brother!

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-40b-instruct"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost](/tiiuae/falcon-40b-instruct/blob/main/(https://huggingface.co/blog/falcon).

You will need **at least 85-100GB of memory** to swiftly run inference with Falcon-40B.

[](#model-card-for-falcon-40b-instruct)Model Card for Falcon-40B-Instruct
=========================================================================

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

*   **Developed by:** [https://www.tii.ae](https://www.tii.ae);
*   **Model type:** Causal decoder-only;
*   **Language(s) (NLP):** English and French;
*   **License:** Apache 2.0;
*   **Finetuned from model:** [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b).

### [](#model-source)Model Source

*   **Paper:** _coming soon_.

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

Falcon-40B-Instruct has been finetuned on a chat dataset.

### [](#out-of-scope-use)Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

Falcon-40B-Instruct is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### [](#recommendations)Recommendations

We recommend users of Falcon-40B-Instruct to develop guardrails and to take appropriate precautions for any production use.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-40b-instruct"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

[](#training-details)Training Details
-------------------------------------

### [](#training-data)Training Data

Falcon-40B-Instruct was finetuned on a 150M tokens from [Bai ze](https://github.com/project-baize/baize-chatbot) mixed with 5% of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) data.

The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) tokenizer.

[](#evaluation)Evaluation
-------------------------

_Paper coming soon._

See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) for early results.

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

For more information about pretraining, see [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b).

### [](#model-architecture-and-objective)Model Architecture and Objective

Falcon-40B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:

*   **Positionnal embeddings:** rotary ([Su et al., 2021](https://arxiv.org/abs/2104.09864));
*   **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135));
*   **Decoder-block:** parallel attention/MLP with a single layer norm.

For multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree.

**Hyperparameter**

**Value**

**Comment**

Layers

60

`d_model`

8192

`head_dim`

64

Reduced to optimise for FlashAttention

Vocabulary

65024

Sequence length

2048

### [](#compute-infrastructure)Compute Infrastructure

#### [](#hardware)Hardware

Falcon-40B-Instruct was trained on AWS SageMaker, on 64 A100 40GB GPUs in P4d instances.

#### [](#software)Software

Falcon-40B-Instruct was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

[](#citation)Citation
---------------------

_Paper coming soon_ . In the meanwhile, you can use the following information to cite:

    @article{falcon40b,
      title={{Falcon-40B}: an open large language model with state-of-the-art performance},
      author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
      year={2023}
    }
    

To learn more about the pretraining dataset, see the  [RefinedWeb paper](https://arxiv.org/abs/2306.01116).

    @article{refinedweb,
      title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
      author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
      journal={arXiv preprint arXiv:2306.01116},
      eprint={2306.01116},
      eprinttype = {arXiv},
      url={https://arxiv.org/abs/2306.01116},
      year={2023}
    }
    

To cite the [Baize](https://github.com/project-baize/baize-chatbot) instruction dataset used for this model:

    @article{xu2023baize,
      title={Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data},
      author={Xu, Canwen and Guo, Daya and Duan, Nan and McAuley, Julian},
      journal={arXiv preprint arXiv:2304.01196},
      year={2023}
    }
    

[](#license)License
-------------------

Falcon-40B-Instruct is made available under the Apache 2.0 license.

[](#contact)Contact
-------------------

[falconllm@tii.ae](mailto:falconllm@tii.ae)

## Model overview

`Falcon-40B-Instruct` is a 40 billion parameter causal decoder-only model built by [TII](https://www.tii.ae) that has been finetuned on a mixture of [Baize](https://github.com/project-baize/baize-chatbot) to make it more suitable for taking instructions in a chat format. It is an extension of the base [Falcon-40B](https://aimodels.fyi/models/huggingFace/falcon-40b-tiiuae) model, which is currently the best open-source large language model available. The Falcon-40B-Instruct model outperforms other instruction-tuned models like [LLaMA](https://github.com/facebookresearch/llama), [StableLM](https://github.com/Stability-AI/StableLM), and [MPT](https://huggingface.co/mosaicml/mpt-7b).

## Model inputs and outputs

Falcon-40B-Instruct is a large language model that can generate human-like text based on provided inputs. It uses an autoregressive architecture, meaning it predicts the next word in a sequence based on the previous words.

### Inputs
- **Text prompts**: The model takes natural language text prompts as input, which can range from a single sentence to multiple paragraphs.

### Outputs
- **Generated text**: The model outputs human-like text continuations based on the provided prompts. The generated text can be used for a variety of applications such as chatbots, content generation, and creative writing assistance.

## Capabilities

Falcon-40B-Instruct demonstrates strong performance on a range of language tasks, including open-ended conversation, question answering, summarization, and task completion. It can engage in contextual back-and-forth exchanges, understand nuanced language, and generate coherent and relevant responses. The model's large size and specialized finetuning allow it to draw upon a vast knowledge base to reason about complex topics and provide substantive, informative outputs.

## What can I use it for?

The Falcon-40B-Instruct model is well-suited for applications that require a capable, open-domain language model with strong instruction-following abilities. Potential use cases include:

- **Chatbots and virtual assistants**: Falcon-40B-Instruct can power conversational AI agents that can engage in natural, open-ended dialogue and assist users with a variety of tasks.
- **Content generation**: The model can be used to generate text for creative writing, article summaries, product descriptions, and other applications where high-quality, human-like text is needed.
- **Task completion**: Falcon-40B-Instruct can understand and execute a wide range of instructions, making it useful for applications that involve following complex multi-step commands.

## Things to try

One interesting aspect of Falcon-40B-Instruct is its ability to engage in extended, contextual exchanges. Try prompting the model with a series of related questions or instructions, and see how it maintains coherence and builds upon the previous context. You can also experiment with prompts that require nuanced reasoning or creativity, as the model's specialized finetuning may allow it to provide more insightful and engaging responses compared to a base language model.

[](#-falcon-180b) Falcon-180B
=================================

**Falcon-180B is a 180B parameters causal decoder-only model built by [TII](https://www.tii.ae) and trained on 3,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. It is made available under the [Falcon-180B TII License](https://huggingface.co/spaces/tiiuae/falcon-180b-license/blob/main/LICENSE.txt) and [Acceptable Use Policy](https://huggingface.co/spaces/tiiuae/falcon-180b-license/blob/main/ACCEPTABLE_USE_POLICY.txt).**

_Paper coming soon_ 

 To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost from HF](https://hf.co/blog/falcon-180b) or this [one](https://huggingface.co/blog/falcon) from the release of the 40B! Note that since the 180B is larger than what can easily be handled with `transformers`+`acccelerate`, we recommend using [Text Generation Inference](https://github.com/huggingface/text-generation-inference).

You will need **at least 400GB of memory** to swiftly run inference with Falcon-180B.

[](#why-use-falcon-180b)Why use Falcon-180B?
--------------------------------------------

*   **It is the best open-access model currently available, and one of the best model overall.** Falcon-180B outperforms [LLaMA-2](https://huggingface.co/meta-llama/Llama-2-70b-hf), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1), [MPT](https://huggingface.co/mosaicml/mpt-7b), etc. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
*   **It features an architecture optimized for inference**, with multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)).
*   **It is made available under a permissive license allowing for commercial use**.
*    **This is a raw, pretrained model, which should be further finetuned for most usecases.** If you are looking for a version better suited to taking generic instructions in a chat format, we recommend taking a look at [Falcon-180B-Chat](https://huggingface.co/tiiuae/falcon-180b-chat).

 **Looking for a smaller, less expensive model?** [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b) are Falcon-180B's little brothers!

 **Falcon LLMs require PyTorch 2.0 for use with `transformers`!**

[](#model-card-for-falcon-180b)Model Card for Falcon-180B
=========================================================

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

*   **Developed by:** [https://www.tii.ae](https://www.tii.ae);
*   **Model type:** Causal decoder-only;
*   **Language(s) (NLP):** English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish);
*   **License:** [Falcon-180B TII License](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) and [Acceptable Use Policy](https://huggingface.co/tiiuae/falcon-180B/blob/main/ACCEPTABLE_USE_POLICY.txt).

### [](#model-source)Model Source

*   **Paper:** _coming soon_.

[](#uses)Uses
-------------

See the [acceptable use policy](https://huggingface.co/tiiuae/falcon-180B/blob/main/ACCEPTABLE_USE_POLICY.txt).

### [](#direct-use)Direct Use

Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)

### [](#out-of-scope-use)Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

Falcon-180B is trained mostly on English, German, Spanish, French, with limited capabilities also in in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### [](#recommendations)Recommendations

We recommend users of Falcon-180B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

To run inference with the model in full `bfloat16` precision you need approximately 8xA100 80GB or equivalent.

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-180b"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

[](#training-details)Training Details
-------------------------------------

### [](#training-data)Training Data

Falcon-180B was trained on 3,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. Significant components from our curated copora were inspired by The Pile ([Gao et al., 2020](https://arxiv.org/abs/2101.00027)).

**Data source**

**Fraction**

**Tokens**

**Sources**

[RefinedWeb-English](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)

75%

750B

massive web crawl

RefinedWeb-Europe

7%

70B

European massive web crawl

Books

6%

60B

Conversations

5%

50B

Reddit, StackOverflow, HackerNews

Code

5%

50B

Technical

2%

20B

arXiv, PubMed, USPTO, etc.

RefinedWeb-Europe is made of the following languages:

**Language**

**Fraction of multilingual data**

**Tokens**

German

26%

18B

Spanish

24%

17B

French

23%

16B

_Italian_

7%

5B

_Portuguese_

4%

3B

_Polish_

4%

3B

_Dutch_

4%

3B

_Romanian_

3%

2B

_Czech_

3%

2B

_Swedish_

2%

1B

The data was tokenized with the Falcon tokenizer.

### [](#training-procedure)Training Procedure

Falcon-180B was trained on up to 4,096 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=8, DP=64) combined with ZeRO.

#### [](#training-hyperparameters)Training Hyperparameters

**Hyperparameter**

**Value**

**Comment**

Precision

`bfloat16`

Optimizer

AdamW

Learning rate

1.25e-4

4B tokens warm-up, cosine decay to 1.25e-5

Weight decay

1e-1

Z-loss

1e-4

Batch size

2048

100B tokens ramp-up

#### [](#speeds-sizes-times)Speeds, Sizes, Times

Training started in early 2023.

[](#evaluation)Evaluation
-------------------------

_Paper coming soon._

See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) for early results.

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

### [](#model-architecture-and-objective)Model Architecture and Objective

Falcon-180B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:

*   **Positionnal embeddings:** rotary ([Su et al., 2021](https://arxiv.org/abs/2104.09864));
*   **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135));
*   **Decoder-block:** parallel attention/MLP with two layer norms.

For multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree (so-called multigroup).

**Hyperparameter**

**Value**

**Comment**

Layers

80

`d_model`

14848

`head_dim`

64

Reduced to optimise for FlashAttention

Vocabulary

65024

Sequence length

2048

### [](#compute-infrastructure)Compute Infrastructure

#### [](#hardware)Hardware

Falcon-180B was trained on AWS SageMaker, on up to 4,096 A100 40GB GPUs in P4d instances.

#### [](#software)Software

Falcon-180B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

[](#citation)Citation
---------------------

_Paper coming soon_  (actually this time). In the meanwhile, you can use the following information to cite:

    @article{falcon,
      title={The Falcon Series of Language Models: Towards Open Frontier Models},
      author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Alhammadi, Maitha and Daniele, Mazzotta and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
      year={2023}
    }
    

To learn more about the pretraining dataset, see the  [RefinedWeb paper](https://arxiv.org/abs/2306.01116).

    @article{refinedweb,
      title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
      author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
      journal={arXiv preprint arXiv:2306.01116},
      eprint={2306.01116},
      eprinttype = {arXiv},
      url={https://arxiv.org/abs/2306.01116},
      year={2023}
    }
    

[](#contact)Contact
-------------------

[falconllm@tii.ae](mailto:falconllm@tii.ae)

## Model overview

The `falcon-180B` is a massive 180 billion parameter causal decoder-only language model developed by the [TII](https://www.tii.ae) team. It was trained on an impressive 3.5 trillion tokens from the [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) dataset and other curated corpora. This makes it one of the largest open-access language models currently available.

The `falcon-180B` builds upon the successes of earlier Falcon models like the [Falcon-40B](https://aimodels.fyi/models/huggingFace/falcon-40b-tiiuae) and [Falcon-7B](https://aimodels.fyi/models/huggingFace/falcon-7b-tiiuae), incorporating architectural innovations like multiquery attention and FlashAttention for improved inference efficiency. It has demonstrated state-of-the-art performance, outperforming models like LLaMA, StableLM, RedPajama, and MPT according to the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

## Model inputs and outputs

### Inputs
- **Text Prompts**: The `falcon-180B` model takes in free-form text prompts as input, which can be in a variety of languages including English, German, Spanish, and French.

### Outputs
- **Generated Text**: Based on the input prompt, the model will generate coherent, contextually-relevant text continuations. The model can produce long-form passages, answer questions, and engage in open-ended dialogue.

## Capabilities

The `falcon-180B` is an extraordinarily capable language model that can perform a wide range of natural language tasks. It excels at open-ended text generation, answering questions, and engaging in dialogue on a diverse array of topics. Given its massive scale, the model has impressive reasoning and knowledge retrieval abilities.

## What can I use it for?

The `falcon-180B` model could be used as a foundation for building sophisticated AI applications across numerous domains. Some potential use cases include:

- **Content Creation**: Generating creative written content like stories, scripts, articles, and marketing copy.
- **Question Answering**: Building intelligent virtual assistants and chatbots that can engage in helpful, contextual dialogue.
- **Research & Analysis**: Aiding in research tasks like literature reviews, hypothesis generation, and data synthesis.
- **Code Generation**: Assisting with software development by generating code snippets and explaining programming concepts.

## Things to try

One fascinating aspect of the `falcon-180B` is its ability to engage in open-ended reasoning and problem-solving. Try giving the model complex prompts that require multi-step logic, abstract thinking, or creative ideation. See how it tackles tasks that go beyond simple text generation, and observe the depth and coherence of its responses.

Another interesting experiment is to fine-tune the `falcon-180B` on domain-specific data relevant to your use case. This can help the model develop specialized knowledge and capabilities tailored to your needs. Explore how the fine-tuned model performs compared to the base version.

[](#-falcon-7b) Falcon-7B
=============================

**Falcon-7B is a 7B parameters causal decoder-only model built by [TII](https://www.tii.ae) and trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. It is made available under the Apache 2.0 license.**

_Paper coming soon_ .

 To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost fron HF](https://huggingface.co/blog/falcon)!

[](#why-use-falcon-7b)Why use Falcon-7B?
----------------------------------------

*   **It outperforms comparable open-source models** (e.g., [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) etc.), thanks to being trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
*   **It features an architecture optimized for inference**, with FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135)) and multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)).
*   **It is made available under a permissive Apache 2.0 license allowing for commercial use**, without any royalties or restrictions.

 **This is a raw, pretrained model, which should be further finetuned for most usecases.** If you are looking for a version better suited to taking generic instructions in a chat format, we recommend taking a look at [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct).

 **Looking for an even more powerful model?** [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b) is Falcon-7B's big brother!

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-7b"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

 **Falcon LLMs require PyTorch 2.0 for use with `transformers`!**

For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost](/tiiuae/falcon-7b/blob/main/(https://huggingface.co/blog/falcon).

You will need **at least 16GB of memory** to swiftly run inference with Falcon-7B.

[](#model-card-for-falcon-7b)Model Card for Falcon-7B
=====================================================

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

*   **Developed by:** [https://www.tii.ae](https://www.tii.ae);
*   **Model type:** Causal decoder-only;
*   **Language(s) (NLP):** English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish);
*   **License:** Apache 2.0.

### [](#model-source)Model Source

*   **Paper:** _coming soon_.

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)

### [](#out-of-scope-use)Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

Falcon-7B is trained on English and French data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### [](#recommendations)Recommendations

We recommend users of Falcon-7B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-7b"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

[](#training-details)Training Details
-------------------------------------

### [](#training-data)Training Data

Falcon-7B was trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. Significant components from our curated copora were inspired by The Pile ([Gao et al., 2020](https://arxiv.org/abs/2101.00027)).

**Data source**

**Fraction**

**Tokens**

**Sources**

[RefinedWeb-English](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)

79%

1,185B

massive web crawl

Books

7%

110B

Conversations

6%

85B

Reddit, StackOverflow, HackerNews

Code

3%

45B

RefinedWeb-French

3%

45B

massive web crawl

Technical

2%

30B

arXiv, PubMed, USPTO, etc.

The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) tokenizer.

### [](#training-procedure)Training Procedure

Falcon-7B was trained on 384 A100 40GB GPUs, using a 2D parallelism strategy (PP=2, DP=192) combined with ZeRO.

#### [](#training-hyperparameters)Training Hyperparameters

**Hyperparameter**

**Value**

**Comment**

Precision

`bfloat16`

Optimizer

AdamW

Learning rate

6e-4

4B tokens warm-up, cosine decay to 1.2e-5

Weight decay

1e-1

Z-loss

1e-4

Batch size

2304

30B tokens ramp-up

#### [](#speeds-sizes-times)Speeds, Sizes, Times

Training happened in early March 2023 and took about two weeks.

[](#evaluation)Evaluation
-------------------------

_Paper coming soon_.

See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) for early results.

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

### [](#model-architecture-and-objective)Model Architecture and Objective

Falcon-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:

*   **Positionnal embeddings:** rotary ([Su et al., 2021](https://arxiv.org/abs/2104.09864));
*   **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135));
*   **Decoder-block:** parallel attention/MLP with a single layer norm.

**Hyperparameter**

**Value**

**Comment**

Layers

32

`d_model`

4544

Increased to compensate for multiquery

`head_dim`

64

Reduced to optimise for FlashAttention

Vocabulary

65024

Sequence length

2048

### [](#compute-infrastructure)Compute Infrastructure

#### [](#hardware)Hardware

Falcon-7B was trained on AWS SageMaker, on 384 A100 40GB GPUs in P4d instances.

#### [](#software)Software

Falcon-7B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

[](#citation)Citation
---------------------

_Paper coming soon_ . In the meanwhile, you can use the following information to cite:

    @article{falcon40b,
      title={{Falcon-40B}: an open large language model with state-of-the-art performance},
      author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
      year={2023}
    }
    

To learn more about the pretraining dataset, see the  [RefinedWeb paper](https://arxiv.org/abs/2306.01116).

    @article{refinedweb,
      title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
      author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
      journal={arXiv preprint arXiv:2306.01116},
      eprint={2306.01116},
      eprinttype = {arXiv},
      url={https://arxiv.org/abs/2306.01116},
      year={2023}
    }
    

[](#license)License
-------------------

Falcon-7B is made available under the Apache 2.0 license.

[](#contact)Contact
-------------------

[falconllm@tii.ae](mailto:falconllm@tii.ae)

## Model Overview

The `falcon-7b` is a 7 billion parameter causal decoder-only language model developed by [TII](https://www.tii.ae). It was trained on 1,500 billion tokens of the [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) dataset, which has been enhanced with curated corpora. The model outperforms comparable open-source models like [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), and [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) on various benchmarks.

## Model Inputs and Outputs

The `falcon-7b` model takes in text as input and generates text as output. It can be used for a variety of natural language processing tasks such as text generation, translation, and question answering.

### Inputs
- Raw text input

### Outputs
- Generated text output

## Capabilities

The `falcon-7b` model is a powerful language model that can be used for a variety of natural language processing tasks. It has shown strong performance on various benchmarks, outperforming comparable open-source models. The model's architecture, which includes FlashAttention and multiquery, is optimized for efficient inference.

## What Can I Use It For?

The `falcon-7b` model can be used as a foundation for further specialization and fine-tuning for specific use cases, such as text generation, chatbots, and content creation. Its permissive Apache 2.0 license also allows for commercial use without royalties or restrictions.

## Things to Try

Developers can experiment with fine-tuning the `falcon-7b` model on their own datasets to adapt it to specific use cases. The model's strong performance on benchmarks suggests it could be a valuable starting point for building advanced natural language processing applications.

[](#-falcon-7b-instruct) Falcon-7B-Instruct
=============================================

**Falcon-7B-Instruct is a 7B parameters causal decoder-only model built by [TII](https://www.tii.ae) based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and finetuned on a mixture of chat/instruct datasets. It is made available under the Apache 2.0 license.**

_Paper coming soon ._

 To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost fron HF](https://huggingface.co/blog/falcon)!

[](#why-use-falcon-7b-instruct)Why use Falcon-7B-Instruct?
----------------------------------------------------------

*   **You are looking for a ready-to-use chat/instruct model based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b).**
*   **Falcon-7B is a strong base model, outperforming comparable open-source models** (e.g., [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) etc.), thanks to being trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
*   **It features an architecture optimized for inference**, with FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135)) and multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)).

 **This is an instruct model, which may not be ideal for further finetuning.** If you are interested in building your own instruct/chat model, we recommend starting from [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b).

 **Looking for an even more powerful model?** [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) is Falcon-7B-Instruct's big brother!

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-7b-instruct"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

 **Falcon LLMs require PyTorch 2.0 for use with `transformers`!**

For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost](/tiiuae/falcon-7b-instruct/blob/main/(https://huggingface.co/blog/falcon).

You will need **at least 16GB of memory** to swiftly run inference with Falcon-7B-Instruct.

[](#model-card-for-falcon-7b-instruct)Model Card for Falcon-7B-Instruct
=======================================================================

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

*   **Developed by:** [https://www.tii.ae](https://www.tii.ae);
*   **Model type:** Causal decoder-only;
*   **Language(s) (NLP):** English and French;
*   **License:** Apache 2.0;
*   **Finetuned from model:** [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b).

### [](#model-source)Model Source

*   **Paper:** _coming soon_.

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

Falcon-7B-Instruct has been finetuned on a mixture of instruct and chat datasets.

### [](#out-of-scope-use)Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

Falcon-7B-Instruct is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### [](#recommendations)Recommendations

We recommend users of Falcon-7B-Instruct to develop guardrails and to take appropriate precautions for any production use.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-7b-instruct"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

[](#training-details)Training Details
-------------------------------------

### [](#training-data)Training Data

Falcon-7B-Instruct was finetuned on a 250M tokens mixture of instruct/chat datasets.

**Data source**

**Fraction**

**Tokens**

**Description**

[Bai ze](https://github.com/project-baize/baize-chatbot)

65%

164M

chat

[GPT4All](https://github.com/nomic-ai/gpt4all)

25%

62M

instruct

[GPTeacher](https://github.com/teknium1/GPTeacher)

5%

11M

instruct

[RefinedWeb-English](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)

5%

13M

massive web crawl

The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) tokenizer.

[](#evaluation)Evaluation
-------------------------

_Paper coming soon._

See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) for early results.

Note that this model variant is not optimized for NLP benchmarks.

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

For more information about pretraining, see [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b).

### [](#model-architecture-and-objective)Model Architecture and Objective

Falcon-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:

*   **Positionnal embeddings:** rotary ([Su et al., 2021](https://arxiv.org/abs/2104.09864));
*   **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135));
*   **Decoder-block:** parallel attention/MLP with a single layer norm.

**Hyperparameter**

**Value**

**Comment**

Layers

32

`d_model`

4544

Increased to compensate for multiquery

`head_dim`

64

Reduced to optimise for FlashAttention

Vocabulary

65024

Sequence length

2048

### [](#compute-infrastructure)Compute Infrastructure

#### [](#hardware)Hardware

Falcon-7B-Instruct was trained on AWS SageMaker, on 32 A100 40GB GPUs in P4d instances.

#### [](#software)Software

Falcon-7B-Instruct was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

[](#citation)Citation
---------------------

_Paper coming soon_ . In the meanwhile, you can use the following information to cite:

    @article{falcon40b,
      title={{Falcon-40B}: an open large language model with state-of-the-art performance},
      author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
      year={2023}
    }
    

To learn more about the pretraining dataset, see the  [RefinedWeb paper](https://arxiv.org/abs/2306.01116).

    @article{refinedweb,
      title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
      author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
      journal={arXiv preprint arXiv:2306.01116},
      eprint={2306.01116},
      eprinttype = {arXiv},
      url={https://arxiv.org/abs/2306.01116},
      year={2023}
    }
    

[](#license)License
-------------------

Falcon-7B-Instruct is made available under the Apache 2.0 license.

[](#contact)Contact
-------------------

[falconllm@tii.ae](mailto:falconllm@tii.ae)

## Model overview

The `falcon-7b-instruct` model is a 7 billion parameter causal decoder-only AI model developed by [TII](https://www.tii.ae). It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been finetuned on a mixture of chat and instruction datasets. The model outperforms comparable open-source models like [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), and [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) thanks to its strong base and optimization for inference.

## Model inputs and outputs

The `falcon-7b-instruct` model takes text prompts as input and generates coherent and relevant text as output. It can be used for a variety of language tasks such as text generation, summarization, and question answering.

### Inputs
- Text prompts for the model to continue or respond to

### Outputs
- Generated text completing or responding to the input prompt

## Capabilities

The `falcon-7b-instruct` model is capable of engaging in open-ended conversations, following instructions, and generating coherent and relevant text across a wide range of topics. It can be used for tasks like creative writing, task planning, and knowledge synthesis.

## What can I use it for?

The `falcon-7b-instruct` model can be used as a foundation for building chatbots, virtual assistants, and other language-based applications. Its ability to follow instructions makes it well-suited for automating repetitive tasks or generating creative content. Developers could use it to build applications in areas like customer service, educational tools, or creative writing assistants.

## Things to try

One interesting thing to try with the `falcon-7b-instruct` model is prompting it with complex multi-step instructions or prompts that require logical reasoning. The model's ability to understand and follow instructions could lead to some surprising and creative outputs. Another interesting direction would be to explore the model's knowledge and reasoning capabilities by asking it to solve problems or provide analysis on a wide range of topics.

[](#-falcon-180b-chat) Falcon-180B-Chat
===========================================

**Falcon-180B-Chat is a 180B parameters causal decoder-only model built by [TII](https://www.tii.ae) based on [Falcon-180B](https://huggingface.co/tiiuae/falcon-180B) and finetuned on a mixture of [Ultrachat](https://huggingface.co/datasets/stingning/ultrachat), [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) and [Airoboros](https://huggingface.co/datasets/jondurbin/airoboros-2.1). It is made available under the [Falcon-180B TII License](https://huggingface.co/tiiuae/falcon-180B-chat/blob/main/LICENSE.txt) and [Acceptable Use Policy](https://huggingface.co/tiiuae/falcon-180B-chat/blob/main/ACCEPTABLE_USE_POLICY.txt).**

_Paper coming soon_ 

 To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost from HF](https://hf.co/blog/falcon-180b) or this [one](https://huggingface.co/blog/falcon) from the release of the 40B! Note that since the 180B is larger than what can easily be handled with `transformers`+`acccelerate`, we recommend using [Text Generation Inference](https://github.com/huggingface/text-generation-inference).

You will need **at least 400GB of memory** to swiftly run inference with Falcon-180B.

[](#why-use-falcon-180b-chat)Why use Falcon-180B-chat?
------------------------------------------------------

*    **You are looking for a ready-to-use chat/instruct model based on [Falcon-180B](https://huggingface.co/tiiuae/falcon-180B).**
*   **It is the best open-access model currently available, and one of the best model overall.** Falcon-180B outperforms [LLaMA-2](https://huggingface.co/meta-llama/Llama-2-70b-hf), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1), [MPT](https://huggingface.co/mosaicml/mpt-7b), etc. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
*   **It features an architecture optimized for inference**, with multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)).
*   **It is made available under a permissive license allowing for commercial use**.

 **This is a Chat model, which may not be ideal for further finetuning.** If you are interested in building your own instruct/chat model, we recommend starting from [Falcon-180B](https://huggingface.co/tiiuae/falcon-180b).

 **Looking for a smaller, less expensive model?** [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) and [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) are Falcon-180B-Chat's little brothers!

 **Falcon LLMs require PyTorch 2.0 for use with `transformers`!**

[](#model-card-for-falcon-180b-chat)Model Card for Falcon-180B-Chat
===================================================================

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

*   **Developed by:** [https://www.tii.ae](https://www.tii.ae);
*   **Model type:** Causal decoder-only;
*   **Language(s) (NLP):** English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish);
*   **License:** [Falcon-180B TII License](https://huggingface.co/tiiuae/falcon-180B-chat/blob/main/LICENSE.txt) and [Acceptable Use Policy](https://huggingface.co/tiiuae/falcon-180B-chat/blob/main/ACCEPTABLE_USE_POLICY.txt).

### [](#model-source)Model Source

*   **Paper:** _coming soon_.

[](#uses)Uses
-------------

See the [acceptable use policy](https://huggingface.co/tiiuae/falcon-180B-chat/blob/main/ACCEPTABLE_USE_POLICY.txt).

### [](#direct-use)Direct Use

Falcon-180B-Chat has been finetuned on a chat dataset.

### [](#out-of-scope-use)Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

Falcon-180B-Chat is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### [](#recommendations)Recommendations

We recommend users of Falcon-180B-Chat to develop guardrails and to take appropriate precautions for any production use.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

To run inference with the model in full `bfloat16` precision you need approximately 8xA100 80GB or equivalent.

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-180b-chat"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

[](#training-details)Training Details
-------------------------------------

**Falcon-180B-Chat is based on [Falcon-180B](https://huggingface.co/tiiuae/falcon-180B).**

### [](#training-data)Training Data

Falcon-180B-Chat is finetuned on a mixture of [Ultrachat](https://huggingface.co/datasets/stingning/ultrachat), [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) and [Airoboros](https://huggingface.co/datasets/jondurbin/airoboros-2.1).

The data was tokenized with the Falcon tokenizer.

[](#evaluation)Evaluation
-------------------------

_Paper coming soon._

See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) for early results.

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

### [](#model-architecture-and-objective)Model Architecture and Objective

Falcon-180B-Chat is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:

*   **Positionnal embeddings:** rotary ([Su et al., 2021](https://arxiv.org/abs/2104.09864));
*   **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135));
*   **Decoder-block:** parallel attention/MLP with a two layer norms.

For multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree.

**Hyperparameter**

**Value**

**Comment**

Layers

80

`d_model`

14848

`head_dim`

64

Reduced to optimise for FlashAttention

Vocabulary

65024

Sequence length

2048

### [](#compute-infrastructure)Compute Infrastructure

#### [](#hardware)Hardware

Falcon-180B-Chat was trained on AWS SageMaker, on up to 4,096 A100 40GB GPUs in P4d instances.

#### [](#software)Software

Falcon-180B-Chat was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

[](#citation)Citation
---------------------

_Paper coming soon_ . In the meanwhile, you can use the following information to cite:

    @article{falcon,
      title={The Falcon Series of Language Models:Towards Open Frontier Models},
      author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
      year={2023}
    }
    

To learn more about the pretraining dataset, see the  [RefinedWeb paper](https://arxiv.org/abs/2306.01116).

    @article{refinedweb,
      title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
      author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
      journal={arXiv preprint arXiv:2306.01116},
      eprint={2306.01116},
      eprinttype = {arXiv},
      url={https://arxiv.org/abs/2306.01116},
      year={2023}
    }
    

[](#contact)Contact
-------------------

[falconllm@tii.ae](mailto:falconllm@tii.ae)

## Model overview

`falcon-180B-chat` is a 180B parameter causal decoder-only language model built by [TII](https://www.tii.ae) based on [Falcon-180B](https://aimodels.fyi/models/huggingFace/falcon-180b-tiiuae) and finetuned on a mixture of chat datasets including [Ultrachat](https://huggingface.co/datasets/stingning/ultrachat), [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus), and [Airoboros](https://huggingface.co/datasets/jondurbin/airoboros-2.1). It is made available under a permissive license allowing for commercial use.

## Model inputs and outputs

`falcon-180B-chat` is a text-to-text model, meaning it takes text as input and generates text as output. The model is a causal decoder-only architecture, which means it can only generate text sequentially by predicting the next token based on the previous tokens.

### Inputs
- Text prompts of any length, up to the model's maximum sequence length of 2048 tokens.

### Outputs
- Continuation of the input text, generating new text that is coherent and relevant to the provided prompt.

## Capabilities

The `falcon-180B-chat` model is one of the largest and most capable open-access language models available. It outperforms other prominent models like [LLaMA-2](https://huggingface.co/meta-llama/Llama-2-70b-hf), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1), and [MPT](https://huggingface.co/mosaicml/mpt-7b) according to the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). It features an architecture optimized for inference, with multiquery attention.

## What can I use it for?

The `falcon-180B-chat` model is well-suited for a variety of language-related tasks, such as text generation, chatbots, and dialogue systems. As a ready-to-use chat model based on the powerful Falcon-180B base, it can be a strong foundation for further finetuning and customization to specific use cases.

## Things to try

Explore the model's capabilities by trying it on a variety of prompts and tasks. For example, see how it performs on open-ended conversations, question-answering, or task-oriented dialogues. You can also experiment with different decoding strategies, such as top-k sampling or beam search, to generate more diverse or controlled outputs.

[](#-falcon2-11b) Falcon2-11B
=================================

**Falcon2-11B is an 11B parameters causal decoder-only model built by [TII](https://www.tii.ae) and trained on over 5,000B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. The model is made available under the [TII Falcon License 2.0](https://falconllm-staging.tii.ae/falcon-2-terms-and-conditions.html), the permissive Apache 2.0-based software license which includes an [acceptable use policy](https://falconllm-staging.tii.ae/falcon-2-acceptable-use-policy.html) that promotes the responsible use of AI.**

_Paper coming soon ._

 To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost from HF](https://huggingface.co/blog/falcon)!

 **This is a raw, pretrained model, which should be further finetuned for most usecases.**

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-11B"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
    )
    sequences = pipeline(
       "Can you explain the concepts of Quantum Computing?",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

 **Falcon LLMs require PyTorch 2.0 for use with `transformers`!**

For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost](/tiiuae/falcon-11B/blob/main/(https://huggingface.co/blog/falcon).

[](#model-card-for-falcon2-11b)Model Card for Falcon2-11B
=========================================================

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

*   **Developed by:** [https://www.tii.ae](https://www.tii.ae)
*   **Model type:** Causal decoder-only
*   **Language(s) (NLP):** English, German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish
*   **License:** [TII Falcon License 2.0](https://falconllm-staging.tii.ae/falcon-2-terms-and-conditions.html)

### [](#model-source)Model Source

*   **Paper:** _coming soon_.

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)

### [](#out-of-scope-use)Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

Falcon2-11B is trained mostly on English, but also German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### [](#recommendations)Recommendations

We recommend users of Falcon2-11B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-11B"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    sequences = pipeline(
       "Can you explain the concepts of Quantum Computing?",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

[](#training-details)Training Details
-------------------------------------

### [](#training-data)Training Data

Falcon2-11B was trained over 5,000B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. It followed a four stage training strategy. The first three stages were focused on increasing the context length, from to 2048 to 4096 and finally to 8192 tokens. The last stage aimed to further enhance performance using only high quality data.

Overall, the data sources included RefinedWeb-English, Refined Web-Europe (cs, de, es, fr, it, nl, pl, pt, ro, sv), high quality technical data, code data, and conversational data extracted from public sources.

The training stages were as follows:

**Stage**

**Context length**

**Tokens**

Stage 1

2048

4500 B

Stage 2

4096

250 B

Stage 3

8192

250 B

Stage 4

8192

500 B

The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.

### [](#training-procedure)Training Procedure

Falcon2-11B was trained on 1024 A100 40GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.

#### [](#training-hyperparameters)Training Hyperparameters

**Hyperparameter**

**Value**

**Comment**

Precision

`bfloat16`

Optimizer

AdamW

Max learning rate

3.7e-4

Following a linear warm-up, then cosine decay to 1.89e-5 across 4500 B tokens.

Weight decay

1e-1

Z-loss

1e-4

Batch size

Variable

Batch size was gradually increased during the training

#### [](#speeds-sizes-times)Speeds, Sizes, Times

The model training took roughly two months.

[](#evaluation)Evaluation
-------------------------

English Benchmark

**Value**

ARC-Challenge-25shots

59.73

HellaSwag-10shots

82.91

MMLU-5shots

58.37

Winogrande-5shots

78.30

TruthfulQA-0shot

52.56

GSM8k-5shots

53.83

ARC-Challenge-0shot

50.17

ARC-Easy-0shot

77.78

Hellaswag-0shot

82.07

We thank the leaderboard team from HuggingFace for providing an official evaluation of our model on the leaderboard tasks.

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

### [](#model-architecture-and-objective)Model Architecture and Objective

Falcon2-11B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:

*   **Positional embeddings:** rotary ([Su et al., 2021](https://arxiv.org/abs/2104.09864));
*   **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention-2 ([Dao, 2023](https://arxiv.org/abs/2307.08691));
*   **Decoder-block:** parallel attention/MLP.

**Hyperparameter**

**Value**

**Comment**

Layers

60

`d_model`

4096

`head_dim`

128

Vocabulary

65024

Sequence length

8192

During stages 3 and 4

### [](#compute-infrastructure)Compute Infrastructure

#### [](#hardware)Hardware

Falcon2-11B was trained on AWS SageMaker, using on average 1024 A100 40GB GPUs in 128 p4d instances.

#### [](#software)Software

Falcon2-11B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO, high-performance Triton kernels and FlashAttention-2. More details about the distributed training strategy can be found in [Almazrouei et.al](https://arxiv.org/abs/2311.16867).

[](#citation)Citation
---------------------

_Paper coming soon_ .

[](#license)License
-------------------

Falcon2-11B is licenced under [TII Falcon License 2.0](https://falconllm-staging.tii.ae/falcon-2-terms-and-conditions.html), the permissive Apache 2.0-based software license which includes an [acceptable use policy](https://falconllm-staging.tii.ae/falcon-2-acceptable-use-policy.html) that promotes the responsible use of AI.

[](#contact)Contact
-------------------

[falconllm@tii.ae](mailto:falconllm@tii.ae)

## Model overview

`falcon-11B` is an 11 billion parameter causal decoder-only model developed by [TII](https://www.tii.ae). The model was trained on over 5,000 billion tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), an enhanced web dataset curated by TII. `falcon-11B` is made available under the TII Falcon License 2.0, which promotes responsible AI use.

Compared to similar models like [falcon-7B](https://aimodels.fyi/models/huggingFace/falcon-7b-tiiuae) and [falcon-40B](https://aimodels.fyi/models/huggingFace/falcon-40b-tiiuae), `falcon-11B` represents a middle ground in terms of size and performance. It outperforms many open-source models while being less resource-intensive than the largest Falcon variants.

## Model inputs and outputs

### Inputs
- Text prompts for language generation tasks

### Outputs
- Coherent, contextually-relevant text continuations
- Responses to queries or instructions

## Capabilities

`falcon-11B` excels at general-purpose language tasks like summarization, question answering, and open-ended text generation. Its strong performance on benchmarks and ability to adapt to various domains make it a versatile model for research and development.

## What can I use it for?

`falcon-11B` is well-suited as a foundation for further specialization and fine-tuning. Potential use cases include:

- Chatbots and conversational AI assistants
- Content generation for marketing, journalism, or creative writing
- Knowledge extraction and question answering systems
- Specialized language models for domains like healthcare, finance, or scientific research

## Things to try

Explore how `falcon-11B`'s performance compares to other open-source language models on your specific tasks of interest. Consider fine-tuning the model on domain-specific data to maximize its capabilities for your needs. The maintainers also recommend checking out the [text generation inference](https://github.com/huggingface/text-generation-inference) project for optimized inference with Falcon models.

[](#falcon-rw-1b)Falcon-RW-1B
=============================

**Falcon-RW-1B is a 1B parameters causal decoder-only model built by [TII](https://www.tii.ae) and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). It is made available under the Apache 2.0 license.**

See the  [paper on arXiv](https://arxiv.org/abs/2306.01116) for more details.

RefinedWeb is a high-quality web dataset built by leveraging stringent filtering and large-scale deduplication. Falcon-RW-1B, trained on RefinedWeb only, matches or outperforms comparable models trained on curated data.

 Falcon is now available as a core model in the `transformers` library! To use the in-library version, please install the latest version of `transformers` with `pip install git+https://github.com/huggingface/transformers.git`, then simply remove the `trust_remote_code=True` argument from `from_pretrained()`.

 This model is intended for use as a **research artifact**, to study the influence of training on web data alone. **If you are interested in state-of-the-art models, we recommend using Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b), both trained on >1,000 billion tokens.**

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-rw-1b"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

 **Falcon LLMs require PyTorch 2.0 for use with `transformers`!**

[](#model-card-for-falcon-rw-1b)Model Card for Falcon-RW-1B
===========================================================

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

*   **Developed by:** [https://www.tii.ae](https://www.tii.ae);
*   **Model type:** Causal decoder-only;
*   **Language(s) (NLP):** English;
*   **License:** Apache 2.0.

### [](#model-source)Model Source

*   **Paper:** [https://arxiv.org/abs/2306.01116](https://arxiv.org/abs/2306.01116).

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

Research on large language models, specifically the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.).

### [](#out-of-scope-use)Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

Broadly speaking, we would recommend Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[40B](https://huggingface.co/tiiuae/falcon-40b) for any use not directly related to research on web data pipelines.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

Falcon-RW-1B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### [](#recommendations)Recommendations

We recommend users of Falcon-RW-1B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    
    model = "tiiuae/falcon-rw-1b"
    
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    

[](#training-details)Training Details
-------------------------------------

### [](#training-data)Training Data

Falcon-RW-1B was trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a high-quality filtered and deduplicated web dataset. The data was tokenized with the GPT-2 tokenizer.

### [](#training-procedure)Training Procedure

Falcon-RW-1B was trained on 32 A100 40GB GPUs, using only data parallelism with ZeRO.

#### [](#training-hyperparameters)Training Hyperparameters

Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)).

**Hyperparameter**

**Value**

**Comment**

Precision

`bfloat16`

Optimizer

AdamW

Learning rate

2e-4

500M tokens warm-up, cosine decay to 2e-5

Weight decay

1e-1

Batch size

512

4B tokens ramp-up

#### [](#speeds-sizes-times)Speeds, Sizes, Times

Training happened in early December 2022 and took about six days.

[](#evaluation)Evaluation
-------------------------

See the  [paper on arXiv](https://arxiv.org/abs/2306.01116) for in-depth evaluation.

[](#technical-specifications)Technical Specifications
-----------------------------------------------------

### [](#model-architecture-and-objective)Model Architecture and Objective

Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), but uses ALiBi ([Ofir et al., 2021](https://arxiv.org/abs/2108.12409)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135)).

**Hyperparameter**

**Value**

**Comment**

Layers

24

`d_model`

2048

`head_dim`

64

Reduced to optimise for FlashAttention

Vocabulary

50304

Sequence length

2048

### [](#compute-infrastructure)Compute Infrastructure

#### [](#hardware)Hardware

Falcon-RW-1B was trained on AWS SageMaker, on 32 A100 40GB GPUs in P4d instances.

#### [](#software)Software

Falcon-RW-1B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

[](#citation)Citation
---------------------

    @article{refinedweb,
      title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
      author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
      journal={arXiv preprint arXiv:2306.01116},
      eprint={2306.01116},
      eprinttype = {arXiv},
      url={https://arxiv.org/abs/2306.01116},
      year={2023}
    }
    

[](#contact)Contact
-------------------

[falconllm@tii.ae](mailto:falconllm@tii.ae)

## Model overview

`falcon-rw-1b` is a 1B parameter causal decoder-only language model developed by [TII](https://www.tii.ae). It was trained on 350B tokens of the [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) dataset, a high-quality web data corpus. Unlike many models trained on curated datasets, `falcon-rw-1b` demonstrates strong performance by leveraging the scale and diversity of web data alone. 

This model is part of the Falcon series of language models from TII, which also includes larger variants like [`falcon-7b`](https://aimodels.fyi/models/huggingFace/falcon-7b-tiiuae) and [`falcon-40b`](https://aimodels.fyi/models/huggingFace/falcon-40b-tiiuae). While these larger models are recommended for most use cases, `falcon-rw-1b` serves as a research artifact to study the influence of training on web data.

## Model inputs and outputs

### Inputs
- **Text prompt**: The model takes a text prompt as input, which it uses to generate additional text.

### Outputs 
- **Generated text**: The model outputs generated text, continuing the input prompt.

## Capabilities

`falcon-rw-1b` demonstrates strong performance on a variety of natural language tasks by leveraging the scale and diversity of its web-based training data. It can be used for tasks like open-ended text generation, summarization, and more. However, as a research model, its capabilities may not match the larger Falcon variants trained on curated data.

## What can I use it for?

The primary use case for `falcon-rw-1b` is as a research artifact to study the impact of training on web data alone. Researchers and developers can experiment with the model to understand the trade-offs and benefits of using large-scale web corpora versus more curated datasets.

While not recommended for production use, `falcon-rw-1b` could potentially be fine-tuned for specific applications like content generation, summarization, or text-based assistants. However, the larger Falcon models would likely be more suitable for these kinds of use cases.

## Things to try

Some interesting things to explore with `falcon-rw-1b` include:

- Evaluating its performance on NLP benchmarks compared to models trained on curated data
- Fine-tuning the model on domain-specific datasets to explore how it adapts
- Analyzing the model's biases and limitations that may arise from its web-based training
- Experimenting with prompting techniques to leverage the model's strengths in open-ended generation

By studying `falcon-rw-1b`, researchers can gain insights into the tradeoffs and potential of training large language models on web-scale datasets.