## Model overview

`gemma-7b` is a 7B parameter version of the Gemma family of lightweight, state-of-the-art open models from Google. Gemma models are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. These models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. The relatively small size of Gemma models makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state-of-the-art AI models.

The Gemma family also includes the [gemma-2b](https://aimodels.fyi/models/huggingFace/gemma-2b-google), [gemma-7b-it](https://aimodels.fyi/models/huggingFace/gemma-7b-it-google), and [gemma-2b-it](https://aimodels.fyi/models/huggingFace/gemma-2b-it-google) models, which offer different parameter sizes and instruction-tuning options.

## Model inputs and outputs

### Inputs
- **Text string**: The model takes a text string as input, such as a question, a prompt, or a document to be summarized.

### Outputs
- **Generated text**: The model generates English-language text in response to the input, such as an answer to a question or a summary of a document.

## Capabilities

The `gemma-7b` model is capable of a wide range of text generation tasks, including question answering, summarization, and reasoning. It can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. The model can also power conversational interfaces for chatbots and virtual assistants, as well as support interactive language learning experiences.

## What can I use it for?

The `gemma-7b` model can be used for a variety of applications across different industries and domains. For example, you could use it to:

- Generate personalized content for marketing campaigns
- Build conversational AI assistants to help with customer service
- Summarize long documents or research papers
- Assist language learners by providing feedback and writing practice

The model's relatively small size and open availability make it accessible for a wide range of developers and researchers, helping to democratize access to state-of-the-art AI capabilities.

## Things to try

One interesting aspect of the `gemma-7b` model is its ability to handle long-form text generation. Unlike some language models that struggle with coherence and consistency over long sequences, the Gemma models are designed to maintain high-quality output even when generating lengthy passages of text. 

You could try using the model to generate extended narratives, such as short stories or creative writing pieces, and see how it performs in terms of maintaining a cohesive plot, character development, and logical flow. Additionally, the model's strong performance on tasks like summarization and question answering could make it a valuable tool for academic and research applications, such as helping to synthesize insights from large bodies of technical literature.

[](#model-card-for-flan-t5-xxl)Model Card for FLAN-T5 XXL
=========================================================

![drawing](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/flan2_architecture.jpg)

[](#table-of-contents)Table of Contents
=======================================

0.  [TL;DR](#TL;DR)
1.  [Model Details](#model-details)
2.  [Usage](#usage)
3.  [Uses](#uses)
4.  [Bias, Risks, and Limitations](#bias-risks-and-limitations)
5.  [Training Details](#training-details)
6.  [Evaluation](#evaluation)
7.  [Environmental Impact](#environmental-impact)
8.  [Citation](#citation)

[](#tldr)TL;DR
==============

If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. As mentioned in the first few lines of the abstract :

> Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

**Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the [T5 model card](https://huggingface.co/t5-large).

[](#model-details)Model Details
===============================

[](#model-description)Model Description
---------------------------------------

*   **Model type:** Language model
*   **Language(s) (NLP):** English, German, French
*   **License:** Apache 2.0
*   **Related Models:** [All FLAN-T5 Checkpoints](https://huggingface.co/models?search=flan-t5)
*   **Original Checkpoints:** [All Original FLAN-T5 Checkpoints](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints)
*   **Resources for more information:**
    *   [Research paper](https://arxiv.org/pdf/2210.11416.pdf)
    *   [GitHub Repo](https://github.com/google-research/t5x)
    *   [Hugging Face FLAN-T5 Docs (Similar to T5)](https://huggingface.co/docs/transformers/model_doc/t5)

[](#usage)Usage
===============

Find below some example scripts on how to use the model in `transformers`:

[](#using-the-pytorch-model)Using the Pytorch model
---------------------------------------------------

### [](#running-the-model-on-a-cpu)Running the model on a CPU

Click to expand

    
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl")
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

### [](#running-the-model-on-a-gpu)Running the model on a GPU

Click to expand

    # pip install accelerate
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto")
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

### [](#running-the-model-on-a-gpu-using-different-precisions)Running the model on a GPU using different precisions

#### [](#fp16)FP16

Click to expand

    # pip install accelerate
    import torch
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", torch_dtype=torch.float16)
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

#### [](#int8)INT8

Click to expand

    # pip install bitsandbytes accelerate
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", load_in_8bit=True)
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

[](#uses)Uses
=============

[](#direct-use-and-downstream-use)Direct Use and Downstream Use
---------------------------------------------------------------

The authors write in [the original paper's model card](https://arxiv.org/pdf/2210.11416.pdf) that:

> The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models

See the [research paper](https://arxiv.org/pdf/2210.11416.pdf) for further details.

[](#out-of-scope-use)Out-of-Scope Use
-------------------------------------

More information needed.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
===========================================================

The information below in this section are copied from the model's [official model card](https://arxiv.org/pdf/2210.11416.pdf):

> Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

[](#ethical-considerations-and-risks)Ethical considerations and risks
---------------------------------------------------------------------

> Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.

[](#known-limitations)Known Limitations
---------------------------------------

> Flan-T5 has not been tested in real world applications.

[](#sensitive-use)Sensitive Use:
--------------------------------

> Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech.

[](#training-details)Training Details
=====================================

[](#training-data)Training Data
-------------------------------

The model was trained on a mixture of tasks, that includes the tasks described in the table below (from the original paper, figure 2):

[![table.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/flan_t5_tasks.png)](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/flan_t5_tasks.png)

[](#training-procedure)Training Procedure
-----------------------------------------

According to the model card from the [original paper](https://arxiv.org/pdf/2210.11416.pdf):

> These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size.

The model has been trained on TPU v3 or TPU v4 pods, using [`t5x`](https://github.com/google-research/t5x) codebase together with [`jax`](https://github.com/google/jax).

[](#evaluation)Evaluation
=========================

[](#testing-data-factors--metrics)Testing Data, Factors & Metrics
-----------------------------------------------------------------

The authors evaluated the model on various tasks covering several languages (1836 in total). See the table below for some quantitative evaluation: [![image.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/flan_t5_evals_lang.png)](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/flan_t5_evals_lang.png) For full details, please check the [research paper](https://arxiv.org/pdf/2210.11416.pdf).

[](#results)Results
-------------------

For full results for FLAN-T5-XXL, see the [research paper](https://arxiv.org/pdf/2210.11416.pdf), Table 3.

[](#environmental-impact)Environmental Impact
=============================================

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

*   **Hardware Type:** Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips  4.
*   **Hours used:** More information needed
*   **Cloud Provider:** GCP
*   **Compute Region:** More information needed
*   **Carbon Emitted:** More information needed

[](#citation)Citation
=====================

**BibTeX:**

    @misc{https://doi.org/10.48550/arxiv.2210.11416,
      doi = {10.48550/ARXIV.2210.11416},
      
      url = {https://arxiv.org/abs/2210.11416},
      
      author = {Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Narang, Sharan and Mishra, Gaurav and Yu, Adams and Zhao, Vincent and Huang, Yanping and Dai, Andrew and Yu, Hongkun and Petrov, Slav and Chi, Ed H. and Dean, Jeff and Devlin, Jacob and Roberts, Adam and Zhou, Denny and Le, Quoc V. and Wei, Jason},
      
      keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
      
      title = {Scaling Instruction-Finetuned Language Models},
      
      publisher = {arXiv},
      
      year = {2022},
      
      copyright = {Creative Commons Attribution 4.0 International}
    }

## Model overview

The `flan-t5-xxl` is a large language model developed by Google that builds upon the T5 transformer architecture. It is part of the FLAN family of models, which have been fine-tuned on over 1,000 additional tasks compared to the original T5 models, spanning a wide range of languages including English, German, French, and many others. As noted in the [research paper](https://arxiv.org/pdf/2210.11416.pdf), the FLAN-T5 models achieve strong few-shot performance, even compared to much larger models like PaLM 62B.

The `flan-t5-xxl` is the extra-extra-large variant of the FLAN-T5 model, with over 10 billion parameters. Compared to similar models like the [Falcon-40B](https://aimodels.fyi/models/huggingFace/falcon-40b-tiiuae) and [FalconLite](https://aimodels.fyi/models/huggingFace/falconlite-amazon), the FLAN-T5 models focus more on being a general-purpose language model that can excel at a wide variety of text-to-text tasks, rather than being optimized for specific use cases.

## Model inputs and outputs

### Inputs
- **Text**: The `flan-t5-xxl` model takes text inputs that can be used for a wide range of natural language processing tasks, such as translation, summarization, question answering, and more.

### Outputs
- **Text**: The model outputs generated text, with the length and content depending on the specific task. For example, it can generate translated text, summaries, or answers to questions.

## Capabilities

The `flan-t5-xxl` model is a powerful general-purpose language model that can be applied to a wide variety of text-to-text tasks. It has been fine-tuned on a massive amount of data and can perform well on tasks like question answering, summarization, and translation, even in a few-shot or zero-shot setting. The model's multilingual capabilities also make it useful for working with text in different languages.

## What can I use it for?

The `flan-t5-xxl` model can be used for a wide range of natural language processing applications, such as:

- **Translation**: Translate text between supported languages, such as English, German, and French.
- **Summarization**: Generate concise summaries of longer text passages.
- **Question Answering**: Answer questions based on provided context.
- **Dialogue Generation**: Generate human-like responses in a conversational setting.
- **Text Generation**: Produce coherent and contextually relevant text on a given topic.

These are just a few examples - the model's broad capabilities make it a versatile tool for working with text data in a variety of domains and applications.

## Things to try

One key aspect of the `flan-t5-xxl` model is its strong few-shot and zero-shot performance, as highlighted in the [research paper](https://arxiv.org/pdf/2210.11416.pdf). This means that the model can often perform well on new tasks with only a small amount of training data, or even without any task-specific fine-tuning. 

To explore this capability, you could try using the model for a range of text-to-text tasks, and see how it performs with just a few examples or no fine-tuning at all. This could help you identify areas where the model excels, as well as potential limitations or biases to be aware of.

Another interesting thing to try would be to compare the performance of the `flan-t5-xxl` model to other large language models, such as the [Falcon-40B](https://aimodels.fyi/models/huggingFace/falcon-40b-tiiuae) or [FalconLite](https://aimodels.fyi/models/huggingFace/falconlite-amazon), on specific tasks or benchmarks. This could provide insights into the relative strengths and weaknesses of each model, and help you choose the best tool for your particular use case.

## Model Overview

The `gemma-7b-it` model is a 7 billion parameter version of the Gemma language model, an open and lightweight model developed by Google. The Gemma model family is built on the same research and technology as Google's Gemini models, and is well-suited for a variety of text generation tasks like question answering, summarization, and reasoning. The 7B instruct version has been further tuned for instruction following, making it useful for applications that require natural language understanding and generation.

The Gemma models are available in different sizes, including a [2B base model](https://huggingface.co/google/gemma-2b), a [7B base model](https://huggingface.co/google/gemma-7b), and a [2B instruct model](https://huggingface.co/google/gemma-2b-it) in addition to the `gemma-7b-it` model. These models are designed to be deployable on resource-constrained environments like laptops and desktops, democratizing access to state-of-the-art language models.

## Model Inputs and Outputs

### Inputs
- Natural language text that the model will generate a response for

### Outputs
- Generated natural language text that responds to or continues the input

## Capabilities

The `gemma-7b-it` model is capable of a wide range of text generation tasks, including question answering, summarization, and open-ended dialogue. It has been trained to follow instructions and can assist with tasks like research, analysis, and creative writing. The model's relatively small size allows it to be deployed on local infrastructure, making it accessible for individual developers and smaller organizations.

## What Can I Use It For?

The `gemma-7b-it` model can be used for a variety of applications that require natural language understanding and generation, such as:

- Question answering systems to provide information and answers to user queries
- Summarization tools to condense long-form text into concise summaries
- Chatbots and virtual assistants for open-ended dialogue and task completion
- Writing assistants to help with research, analysis, and creative projects

The model's instruction-following capabilities also make it useful for building applications that allow users to interact with the AI through natural language commands.

## Things to Try

Here are some ideas for interesting things to try with the `gemma-7b-it` model:

- Use the model to generate creative writing prompts and short stories
- Experiment with the model's ability to follow complex instructions and break them down into actionable steps
- Finetune the model on domain-specific data to create a specialized assistant for your field of interest
- Explore the model's reasoning and analytical capabilities by asking it to summarize research papers or provide insights on data

Remember to check the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible) for guidance on using the model ethically and safely.

[](#model-card-for-flan-t5-base)Model Card for FLAN-T5 base
===========================================================

![drawing](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/flan2_architecture.jpg)

[](#table-of-contents)Table of Contents
=======================================

0.  [TL;DR](#TL;DR)
1.  [Model Details](#model-details)
2.  [Usage](#usage)
3.  [Uses](#uses)
4.  [Bias, Risks, and Limitations](#bias-risks-and-limitations)
5.  [Training Details](#training-details)
6.  [Evaluation](#evaluation)
7.  [Environmental Impact](#environmental-impact)
8.  [Citation](#citation)
9.  [Model Card Authors](#model-card-authors)

[](#tldr)TL;DR
==============

If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. As mentioned in the first few lines of the abstract :

> Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

**Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the [T5 model card](https://huggingface.co/t5-large).

[](#model-details)Model Details
===============================

[](#model-description)Model Description
---------------------------------------

*   **Model type:** Language model
*   **Language(s) (NLP):** English, Spanish, Japanese, Persian, Hindi, French, Chinese, Bengali, Gujarati, German, Telugu, Italian, Arabic, Polish, Tamil, Marathi, Malayalam, Oriya, Panjabi, Portuguese, Urdu, Galician, Hebrew, Korean, Catalan, Thai, Dutch, Indonesian, Vietnamese, Bulgarian, Filipino, Central Khmer, Lao, Turkish, Russian, Croatian, Swedish, Yoruba, Kurdish, Burmese, Malay, Czech, Finnish, Somali, Tagalog, Swahili, Sinhala, Kannada, Zhuang, Igbo, Xhosa, Romanian, Haitian, Estonian, Slovak, Lithuanian, Greek, Nepali, Assamese, Norwegian
*   **License:** Apache 2.0
*   **Related Models:** [All FLAN-T5 Checkpoints](https://huggingface.co/models?search=flan-t5)
*   **Original Checkpoints:** [All Original FLAN-T5 Checkpoints](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints)
*   **Resources for more information:**
    *   [Research paper](https://arxiv.org/pdf/2210.11416.pdf)
    *   [GitHub Repo](https://github.com/google-research/t5x)
    *   [Hugging Face FLAN-T5 Docs (Similar to T5)](https://huggingface.co/docs/transformers/model_doc/t5)

[](#usage)Usage
===============

Find below some example scripts on how to use the model in `transformers`:

[](#using-the-pytorch-model)Using the Pytorch model
---------------------------------------------------

### [](#running-the-model-on-a-cpu)Running the model on a CPU

Click to expand

    
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

### [](#running-the-model-on-a-gpu)Running the model on a GPU

Click to expand

    # pip install accelerate
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base", device_map="auto")
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

### [](#running-the-model-on-a-gpu-using-different-precisions)Running the model on a GPU using different precisions

#### [](#fp16)FP16

Click to expand

    # pip install accelerate
    import torch
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base", device_map="auto", torch_dtype=torch.float16)
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

#### [](#int8)INT8

Click to expand

    # pip install bitsandbytes accelerate
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base", device_map="auto", load_in_8bit=True)
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

[](#uses)Uses
=============

[](#direct-use-and-downstream-use)Direct Use and Downstream Use
---------------------------------------------------------------

The authors write in [the original paper's model card](https://arxiv.org/pdf/2210.11416.pdf) that:

> The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models

See the [research paper](https://arxiv.org/pdf/2210.11416.pdf) for further details.

[](#out-of-scope-use)Out-of-Scope Use
-------------------------------------

More information needed.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
===========================================================

The information below in this section are copied from the model's [official model card](https://arxiv.org/pdf/2210.11416.pdf):

> Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

[](#ethical-considerations-and-risks)Ethical considerations and risks
---------------------------------------------------------------------

> Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.

[](#known-limitations)Known Limitations
---------------------------------------

> Flan-T5 has not been tested in real world applications.

[](#sensitive-use)Sensitive Use:
--------------------------------

> Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech.

[](#training-details)Training Details
=====================================

[](#training-data)Training Data
-------------------------------

The model was trained on a mixture of tasks, that includes the tasks described in the table below (from the original paper, figure 2):

[![table.png](https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png)](https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png)

[](#training-procedure)Training Procedure
-----------------------------------------

According to the model card from the [original paper](https://arxiv.org/pdf/2210.11416.pdf):

> These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size.

The model has been trained on TPU v3 or TPU v4 pods, using [`t5x`](https://github.com/google-research/t5x) codebase together with [`jax`](https://github.com/google/jax).

[](#evaluation)Evaluation
=========================

[](#testing-data-factors--metrics)Testing Data, Factors & Metrics
-----------------------------------------------------------------

The authors evaluated the model on various tasks covering several languages (1836 in total). See the table below for some quantitative evaluation: [![image.png](https://s3.amazonaws.com/moonup/production/uploads/1668072995230-62441d1d9fdefb55a0b7d12c.png)](https://s3.amazonaws.com/moonup/production/uploads/1668072995230-62441d1d9fdefb55a0b7d12c.png) For full details, please check the [research paper](https://arxiv.org/pdf/2210.11416.pdf).

[](#results)Results
-------------------

For full results for FLAN-T5-Base, see the [research paper](https://arxiv.org/pdf/2210.11416.pdf), Table 3.

[](#environmental-impact)Environmental Impact
=============================================

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

*   **Hardware Type:** Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips  4.
*   **Hours used:** More information needed
*   **Cloud Provider:** GCP
*   **Compute Region:** More information needed
*   **Carbon Emitted:** More information needed

[](#citation)Citation
=====================

**BibTeX:**

    @misc{https://doi.org/10.48550/arxiv.2210.11416,
      doi = {10.48550/ARXIV.2210.11416},
      
      url = {https://arxiv.org/abs/2210.11416},
      
      author = {Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Narang, Sharan and Mishra, Gaurav and Yu, Adams and Zhao, Vincent and Huang, Yanping and Dai, Andrew and Yu, Hongkun and Petrov, Slav and Chi, Ed H. and Dean, Jeff and Devlin, Jacob and Roberts, Adam and Zhou, Denny and Le, Quoc V. and Wei, Jason},
      
      keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
      
      title = {Scaling Instruction-Finetuned Language Models},
      
      publisher = {arXiv},
      
      year = {2022},
      
      copyright = {Creative Commons Attribution 4.0 International}
    }
    

[](#model-recycling)Model Recycling
-----------------------------------

[Evaluation on 36 datasets](https://ibm.github.io/model-recycling/model_gain_chart?avg=9.16&mnli_lp=nan&20_newsgroup=3.34&ag_news=1.49&amazon_reviews_multi=0.21&anli=13.91&boolq=16.75&cb=23.12&cola=9.97&copa=34.50&dbpedia=6.90&esnli=5.37&financial_phrasebank=18.66&imdb=0.33&isear=1.37&mnli=11.74&mrpc=16.63&multirc=6.24&poem_sentiment=14.62&qnli=3.41&qqp=6.18&rotten_tomatoes=2.98&rte=24.26&sst2=0.67&sst_5bins=5.44&stsb=20.68&trec_coarse=3.95&trec_fine=10.73&tweet_ev_emoji=13.39&tweet_ev_emotion=4.62&tweet_ev_hate=3.46&tweet_ev_irony=9.04&tweet_ev_offensive=1.69&tweet_ev_sentiment=0.75&wic=14.22&wnli=9.44&wsc=5.53&yahoo_answers=4.14&model_name=google%2Fflan-t5-base&base_name=google%2Ft5-v1_1-base) using google/flan-t5-base as a base model yields average score of 77.98 in comparison to 68.82 by google/t5-v1\_1-base.

The model is ranked 1st among all tested models for the google/t5-v1\_1-base architecture as of 06/02/2023 Results:

20\_newsgroup

ag\_news

amazon\_reviews\_multi

anli

boolq

cb

cola

copa

dbpedia

esnli

financial\_phrasebank

imdb

isear

mnli

mrpc

multirc

poem\_sentiment

qnli

qqp

rotten\_tomatoes

rte

sst2

sst\_5bins

stsb

trec\_coarse

trec\_fine

tweet\_ev\_emoji

tweet\_ev\_emotion

tweet\_ev\_hate

tweet\_ev\_irony

tweet\_ev\_offensive

tweet\_ev\_sentiment

wic

wnli

wsc

yahoo\_answers

86.2188

89.6667

67.12

51.9688

82.3242

78.5714

80.1534

75

77.6667

90.9507

85.4

93.324

72.425

87.2457

89.4608

62.3762

82.6923

92.7878

89.7724

89.0244

84.8375

94.3807

57.2851

89.4759

97.2

92.8

46.848

80.2252

54.9832

76.6582

84.3023

70.6366

70.0627

56.338

53.8462

73.4

For more information, see: [Model Recycling](https://ibm.github.io/model-recycling/)

## Model overview

`flan-t5-base` is a language model developed by Google that is part of the FLAN-T5 family. It is an improved version of the original T5 model, with additional fine-tuning on over 1,000 tasks covering a variety of languages. Compared to the original T5 model, FLAN-T5 models like `flan-t5-base` are better at a wide range of tasks, including question answering, reasoning, and few-shot learning. The model is available in a range of sizes, from the base `flan-t5-base` to the much larger `flan-t5-xxl`. 

Similar FLAN-T5 models include [flan-t5-xxl](https://aimodels.fyi/models/huggingFace/flan-t5-xxl-google), which is a larger version of the model with better performance on some benchmarks. The Falcon series of models from TII, like [Falcon-40B](https://aimodels.fyi/models/huggingFace/falcon-40b-tiiuae) and [Falcon-180B](https://aimodels.fyi/models/huggingFace/falcon-180b-tiiuae), are also strong open-source language models that can be used for similar tasks.

## Model inputs and outputs

### Inputs
- **Text**: The `flan-t5-base` model takes text input, which can be in the form of a single sentence, a paragraph, or even longer documents.

### Outputs
- **Text**: The model generates text output, which can be used for a variety of tasks such as translation, summarization, question answering, and more.

## Capabilities

The `flan-t5-base` model is a powerful text-to-text transformer that can be used for a wide range of natural language processing tasks. It has shown strong performance on benchmarks like MMLU, HellaSwag, PIQA, and others, often outperforming even much larger language models. The model's versatility and few-shot learning capabilities make it a valuable tool for researchers and developers working on a variety of NLP applications.

## What can I use it for?

The `flan-t5-base` model can be used for a variety of natural language processing tasks, including:

- **Content Creation and Communication**: The model can be used to generate creative text, power chatbots and virtual assistants, and produce text summaries.
- **Research and Education**: Researchers can use the model as a foundation for experimenting with NLP techniques, developing new algorithms, and contributing to the advancement of the field. Educators can also leverage the model to create interactive language learning experiences.

## Things to try

One interesting aspect of the `flan-t5-base` model is its strong few-shot learning capabilities. This means that the model can often perform well on new tasks with just a few examples, without requiring extensive fine-tuning. Developers and researchers can experiment with prompting the model with different task descriptions and a small number of examples to see how it performs on a variety of downstream applications.

Another area to explore is the model's multilingual capabilities. The `flan-t5-base` model is trained on over 100 languages, which opens up opportunities to use it for cross-lingual tasks like machine translation, multilingual question answering, and more.

## Model overview

The `gemma-2b` model is a lightweight, state-of-the-art open model from Google, built from the same research and technology used to create the Gemini models. It is part of the Gemma family of text-to-text, decoder-only large language models available in English, with open weights, pre-trained variants, and instruction-tuned variants. The [Gemma 7B base model](https://aimodels.fyi/models/huggingFace/gemma-7b-google-deepmind), [Gemma 7B instruct model](https://aimodels.fyi/models/huggingFace/gemma-7b-it-google-deepmind), and [Gemma 2B instruct model](https://aimodels.fyi/models/huggingFace/gemma-2b-it-google-deepmind) are other variants in the Gemma family. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation.

## Model inputs and outputs

The `gemma-2b` model is a text-to-text, decoder-only large language model. It takes text as input and generates English-language text in response, such as answers to questions, summaries of documents, or other types of generated content.

### Inputs
- Text strings, such as questions, prompts, or documents to be summarized

### Outputs 
- Generated English-language text in response to the input, such as answers, summaries, or other types of generated content

## Capabilities

The `gemma-2b` model excels at a variety of text generation tasks. It can be used to generate creative content like poems, scripts, and marketing copy. It can also power conversational interfaces for chatbots and virtual assistants, or provide text summarization capabilities. The model has demonstrated strong performance on benchmarks evaluating tasks like question answering, common sense reasoning, and code generation.

## What can I use it for?

The `gemma-2b` model can be leveraged for a wide range of natural language processing applications. For content creation, you could use it to draft blog posts, emails, or other written materials. In the education and research domains, it could assist with language learning tools, knowledge exploration, and advancing natural language processing research. Developers could integrate the model into chatbots, virtual assistants, and other conversational AI applications.

## Things to try

One interesting aspect of the `gemma-2b` model is its relatively small size compared to larger language models, yet it still maintains state-of-the-art performance on many benchmarks. This makes it well-suited for deployment in resource-constrained environments like edge devices or personal computers. You could experiment with using the model to generate content on your local machine or explore its capabilities for tasks like code generation or common sense reasoning. The model's open weights and [well-documented usage examples](https://huggingface.co/google/gemma-7b/tree/main/examples) also make it an appealing choice for researchers and developers looking to experiment with and build upon large language model technologies.

Platform did not provide a description for this model.

## Model overview

The `timesfm-1.0-200m` is an AI model developed by [Google](https://aimodels.fyi/creators/huggingFace/google). It is a text-to-text model, meaning it can be used for a variety of natural language processing tasks. The model is similar to other text-to-text models like [evo-1-131k-base](https://aimodels.fyi/models/huggingFace/evo-1-131k-base-togethercomputer), [longchat-7b-v1.5-32k](https://aimodels.fyi/models/huggingFace/longchat-7b-v15-32k-lmsys), and [h2ogpt-gm-oasst1-en-2048-falcon-7b-v2](https://aimodels.fyi/models/huggingFace/h2ogpt-gm-oasst1-en-2048-falcon-7b-v2-h2oai).

## Model inputs and outputs

The `timesfm-1.0-200m` model takes in text as input and generates text as output. The input can be any kind of natural language text, such as sentences, paragraphs, or entire documents. The output can be used for a variety of tasks, such as text generation, text summarization, and language translation.

### Inputs
- Natural language text

### Outputs
- Natural language text

## Capabilities

The `timesfm-1.0-200m` model has a range of capabilities, including text generation, text summarization, and language translation. It can be used to generate coherent and fluent text on a variety of topics, and can also be used to summarize longer documents or translate between different languages.

## What can I use it for?

The `timesfm-1.0-200m` model can be used for a variety of applications, such as chatbots, content creation, and language learning. For example, a company could use the model to generate product descriptions or marketing content, or an individual could use it to practice a foreign language. The model could also be fine-tuned on specific datasets to perform specialized tasks, such as legal document summarization or medical text generation.

## Things to try

Some interesting things to try with the `timesfm-1.0-200m` model include generating creative short stories, summarizing academic papers, and translating between different languages. The model's versatility makes it a useful tool for a wide range of natural language processing tasks.

[](#vision-transformer-base-sized-model)Vision Transformer (base-sized model)
=============================================================================

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.

Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a \[CLS\] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the \[CLS\] token, as the last hidden state of this token can be seen as a representation of an entire image.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=google/vit) to look for fine-tuned versions on a task that interests you.

### [](#how-to-use)How to use

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

    from transformers import ViTImageProcessor, ViTForImageClassification
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
    model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    # model predicts one of the 1000 ImageNet classes
    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    

For more code examples, we refer to the [documentation](https://huggingface.co/transformers/model_doc/vit.html#).

[](#training-data)Training data
-------------------------------

The ViT model was pretrained on [ImageNet-21k](http://www.image-net.org/), a dataset consisting of 14 million images and 21k classes, and fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/), a dataset consisting of 1 million images and 1k classes.

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py).

Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

### [](#pretraining)Pretraining

The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Training resolution is 224.

[](#evaluation-results)Evaluation results
-----------------------------------------

For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @misc{wu2020visual,
          title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
          author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
          year={2020},
          eprint={2006.03677},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
    }
    

    @inproceedings{deng2009imagenet,
      title={Imagenet: A large-scale hierarchical image database},
      author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
      booktitle={2009 IEEE conference on computer vision and pattern recognition},
      pages={248--255},
      year={2009},
      organization={Ieee}
    }

## Model overview

The `vit-base-patch16-224` is a Vision Transformer (ViT) model pre-trained on ImageNet-21k, a large dataset of 14 million images across 21,843 classes. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). The weights were later converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman.

The `vit-base-patch16-224-in21k` model is another ViT model pre-trained on the larger ImageNet-21k dataset, but not fine-tuned on the smaller ImageNet 2012 dataset like the `vit-base-patch16-224` model. Both models use a transformer encoder architecture to process images as sequences of fixed-size patches, with the addition of a [CLS] token for classification tasks.

The `all-mpnet-base-v2` is a sentence-transformer model that maps sentences and paragraphs to a 768-dimensional dense vector space, enabling tasks like clustering and semantic search. It was fine-tuned on over 1 billion sentence pairs using a self-supervised contrastive learning objective.

The `owlvit-base-patch32` model is designed for zero-shot and open-vocabulary object detection, allowing it to detect objects without relying on pre-defined class labels.

The `stable-diffusion-x4-upscaler` is a text-guided latent diffusion model trained for 1.25M steps on high-resolution images (>2048x2048) from the LAION dataset. It can be used to upscale low-resolution images by 4x while preserving semantic information.

## Model inputs and outputs

### Inputs

- **Images**: The `vit-base-patch16-224` and `vit-base-patch16-224-in21k` models take images as input, which are divided into fixed-size patches and linearly embedded.
- **Sentences/Paragraphs**: The `all-mpnet-base-v2` model takes sentences or paragraphs as input and encodes them into a dense vector representation.
- **Low-resolution images and text prompts**: The `stable-diffusion-x4-upscaler` model takes low-resolution images and text prompts as input, and generates a high-resolution upscaled image.

### Outputs

- **Image classification logits**: The `vit-base-patch16-224` and `vit-base-patch16-224-in21k` models output logits for each of the 1,000 ImageNet classes.
- **Sentence embeddings**: The `all-mpnet-base-v2` model outputs a 768-dimensional vector representation for each input sentence or paragraph.
- **High-resolution upscaled images**: The `stable-diffusion-x4-upscaler` model generates a high-resolution (4x) upscaled image based on the input low-resolution image and text prompt.

## Capabilities

The `vit-base-patch16-224` and `vit-base-patch16-224-in21k` models are capable of classifying images into 1,000 ImageNet classes with high accuracy. The `all-mpnet-base-v2` model can be used for a variety of sentence-level tasks, such as information retrieval, clustering, and semantic search. The `stable-diffusion-x4-upscaler` model can generate high-resolution images from low-resolution inputs while preserving semantic information.

## What can I use it for?

The `vit-base-patch16-224` and `vit-base-patch16-224-in21k` models can be used for image classification tasks, such as recognizing objects, scenes, or activities in images. The `all-mpnet-base-v2` model can be used to build applications that require semantic understanding of text, such as chatbots, search engines, or recommendation systems. The `stable-diffusion-x4-upscaler` model can be used to generate high-quality images for use in creative applications, design, or visualization.

## Things to try

With the `vit-base-patch16-224` and `vit-base-patch16-224-in21k` models, you can try fine-tuning them on your own image classification datasets to adapt them to your specific needs. The `all-mpnet-base-v2` model can be used as a starting point for training your own sentence embedding models, or to generate sentence-level features for downstream tasks. The `stable-diffusion-x4-upscaler` model can be combined with text-to-image generation models to create high-resolution images from text prompts, opening up new possibilities for creative applications.

[](#model-card-for-flan-ul2)Model card for Flan-UL2
===================================================

[![model image](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/ul2.png)](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/ul2.png)

[](#table-of-contents)Table of Contents
=======================================

0.  [TL;DR](#TL;DR)
1.  [Using the model](#using-the-model)
2.  [Results](#results)
3.  [Introduction to UL2](#introduction-to-ul2)
4.  [Training](#training)
5.  [Contribution](#contribution)
6.  [Citation](#citation)

[](#tldr)TL;DR
==============

Flan-UL2 is an encoder decoder model based on the `T5` architecture. It uses the same configuration as the [`UL2 model`](https://huggingface.co/google/ul2) released earlier last year. It was fine tuned using the "Flan" prompt tuning and dataset collection.

According to the original [blog](https://www.yitay.net/blog/flan-ul2-20b) here are the notable improvements:

*   The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
*   The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
*   The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget mode tokens before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.

[](#using-the-model)Using the model
===================================

[](#converting-from-t5x-to-huggingface)Converting from T5x to huggingface
-------------------------------------------------------------------------

You can use the [`convert_t5x_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_pytorch.py) script and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, that is why we are passing the `strict = False` argument.

    python convert_t5x_checkpoint_to_pytorch.py --t5x_checkpoint_path PATH_TO_T5X_CHECKPOINTS --config_file PATH_TO_CONFIG --pytorch_dump_path PATH_TO_SAVE
    

We used the same config file as [`google/ul2`](https://huggingface.co/google/ul2/blob/main/config.json).

[](#running-the-model)Running the model
---------------------------------------

For more efficient memory usage, we advise you to load the model in `8bit` using `load_in_8bit` flag as follows (works only under GPU):

    # pip install accelerate transformers bitsandbytes
    from transformers import T5ForConditionalGeneration, AutoTokenizer
    import torch
    model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2", device_map="auto", load_in_8bit=True)                                                                 
    tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
    
    input_string = "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?"                                               
    
    inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(inputs, max_length=200)
    
    print(tokenizer.decode(outputs[0]))
    # <pad> They have 23 - 20 = 3 apples left. They have 3 + 6 = 9 apples. Therefore, the answer is 9.</s>
    

Otherwise, you can load and run the model in `bfloat16` as follows:

    # pip install accelerate transformers
    from transformers import T5ForConditionalGeneration, AutoTokenizer
    import torch
    model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2", torch_dtype=torch.bfloat16, device_map="auto")                                                                 
    tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
    
    input_string = "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?"                                               
    
    inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(inputs, max_length=200)
    
    print(tokenizer.decode(outputs[0]))
    # <pad> They have 23 - 20 = 3 apples left. They have 3 + 6 = 9 apples. Therefore, the answer is 9.</s>
    

[](#results)Results
===================

[](#performance-improvment)Performance improvment
-------------------------------------------------

The reported results are the following :

MMLU

BBH

MMLU-CoT

BBH-CoT

Avg

FLAN-PaLM 62B

59.6

47.5

56.9

44.9

49.9

FLAN-PaLM 540B

73.5

57.9

70.9

66.3

67.2

FLAN-T5-XXL 11B

55.1

45.3

48.6

41.4

47.6

FLAN-UL2 20B

55.7(+1.1%)

45.9(+1.3%)

52.2(+7.4%)

42.7(+3.1%)

49.1(+3.2%)

[](#introduction-to-ul2)Introduction to UL2
===========================================

This entire section has been copied from the [`google/ul2`](https://huggingface.co/google/ul2) model card and might be subject of change with respect to `flan-ul2`.

UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.

[![model image](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/ul2.png)](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/ul2.png)

**Abstract**

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.

For more information, please take a look at the original paper.

Paper: [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1)

Authors: _Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler_

[](#training)Training
---------------------

### [](#flan-ul2)Flan UL2

The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,

In Scaling Instruction-Finetuned language models (Chung et al.) (also referred to sometimes as the Flan2 paper), the key idea is to train a large language model on a collection of datasets. These datasets are phrased as instructions which enable generalization across diverse tasks. Flan has been primarily trained on academic tasks. In Flan2, we released a series of T5 models ranging from 200M to 11B parameters that have been instruction tuned with Flan.

The Flan datasets have also been open sourced in The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (Longpre et al.). See Google AI Blogpost: The Flan Collection: Advancing Open Source Methods for Instruction Tuning.

[](#ul2-pretraining)UL2 PreTraining
-----------------------------------

The model is pretrained on the C4 corpus. For pretraining, the model is trained on a total of 1 trillion tokens on C4 (2 million steps) with a batch size of 1024. The sequence length is set to 512/512 for inputs and targets. Dropout is set to 0 during pretraining. Pre-training took slightly more than one month for about 1 trillion tokens. The model has 32 encoder layers and 32 decoder layers, `dmodel` of 4096 and `df` of 16384. The dimension of each head is 256 for a total of 16 heads. Our model uses a model parallelism of 8. The same sentencepiece tokenizer as T5 of vocab size 32000 is used (click [here](https://huggingface.co/docs/transformers/v4.20.0/en/model_doc/t5#transformers.T5Tokenizer) for more information about the T5 tokenizer).

UL-20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs. UL-20B was trained using the [Jax](https://github.com/google/jax) and [T5X](https://github.com/google-research/t5x) infrastructure.

The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:

### [](#mixture-of-denoisers)Mixture of Denoisers

To quote the paper:

> We conjecture that a strong universal model has to be exposed to solving diverse set of problems during pre-training. Given that pre-training is done using self-supervision, we argue that such diversity should be injected to the objective of the model, otherwise the model might suffer from lack a certain ability, like long-coherent text generation. Motivated by this, as well as current class of objective functions, we define three main paradigms that are used during pre-training:

*   **R-Denoiser**: The regular denoising is the standard span corruption introduced in [T5](https://huggingface.co/docs/transformers/v4.20.0/en/model_doc/t5) that uses a range of 2 to 5 tokens as the span length, which masks about 15% of input tokens. These spans are short and potentially useful to acquire knowledge instead of learning to generate fluent text.
    
*   **S-Denoiser**: A specific case of denoising where we observe a strict sequential order when framing the inputs-to-targets task, i.e., prefix language modeling. To do so, we simply partition the input sequence into two sub-sequences of tokens as context and target such that the targets do not rely on future information. This is unlike standard span corruption where there could be a target token with earlier position than a context token. Note that similar to the Prefix-LM setup, the context (prefix) retains a bidirectional receptive field. We note that S-Denoising with very short memory or no memory is in similar spirit to standard causal language modeling.
    
*   **X-Denoiser**: An extreme version of denoising where the model must recover a large part of the input, given a small to moderate part of it. This simulates a situation where a model needs to generate long target from a memory with relatively limited information. To do so, we opt to include examples with aggressive denoising where approximately 50% of the input sequence is masked. This is by increasing the span length and/or corruption rate. We consider a pre-training task to be extreme if it has a long span (e.g.,  12 tokens) or have a large corruption rate (e.g.,  30%). X-denoising is motivated by being an interpolation between regular span corruption and language model like objectives.
    

See the following diagram for a more visual explanation:

[![mixture-of-denoisers](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/mod.png)](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/mod.png)

**Important**: For more details, please see sections 3.1.2 of the [paper](https://arxiv.org/pdf/2205.05131v1.pdf).

[](#fine-tuning)Fine-tuning
---------------------------

The model was continously fine-tuned after N pretraining steps where N is typically from 50k to 100k. In other words, after each Nk steps of pretraining, the model is finetuned on each downstream task. See section 5.2.2 of [paper](https://arxiv.org/pdf/2205.05131v1.pdf) to get an overview of all datasets that were used for fine-tuning).

As the model is continuously finetuned, finetuning is stopped on a task once it has reached state-of-the-art to save compute. In total, the model was trained for 2.65 million steps.

**Important**: For more details, please see sections 5.2.1 and 5.2.2 of the [paper](https://arxiv.org/pdf/2205.05131v1.pdf).

[](#contribution)Contribution
=============================

This model was originally contributed by [Yi Tay](https://www.yitay.net/?author=636616684c5e64780328eece), and added to the Hugging Face ecosystem by [Younes Belkada](https://huggingface.co/ybelkada) & [Arthur Zucker](https://huggingface.co/ArthurZ).

[](#citation)Citation
=====================

If you want to cite this work, please consider citing the [blogpost](https://www.yitay.net/blog/flan-ul2-20b) announcing the release of `Flan-UL2`.

## Model overview

`flan-ul2` is an encoder-decoder model based on the T5 architecture, developed by Google. It uses the same configuration as the earlier `UL2` model, but with some key improvements. Unlike the original UL2 model which had a receptive field of only 512, `flan-ul2` has a receptive field of 2048, making it more suitable for few-shot in-context learning tasks. Additionally, the `flan-ul2` checkpoint does not require the use of mode switch tokens, which were previously necessary to achieve good performance.

The `flan-ul2` model was fine-tuned using the "Flan" prompt tuning approach and a curated dataset. This process aimed to improve the model's few-shot abilities compared to the original UL2 model. Similar models include the [flan-t5-xxl](https://aimodels.fyi/models/huggingFace/flan-t5-xxl-google) and [flan-t5-base](https://aimodels.fyi/models/huggingFace/flan-t5-base-google) models, which were also fine-tuned on a broad range of tasks.

## Model inputs and outputs

### Inputs
- **Text**: The model accepts natural language text as input, which can be in the form of a single sentence, a paragraph, or a longer passage.

### Outputs
- **Text**: The model generates natural language text as output, which can be used for tasks such as language translation, summarization, question answering, and more.

## Capabilities

The `flan-ul2` model is capable of a wide range of text-to-text tasks, including translation, summarization, and question answering. Its improved receptive field and removal of mode switch tokens make it better suited for few-shot learning compared to the original UL2 model.

## What can I use it for?

The `flan-ul2` model can be used as a foundation for various natural language processing applications, such as building chatbots, content generation tools, and personalized language assistants. Its few-shot learning capabilities make it a promising candidate for research into in-context learning and zero-shot task generalization.

## Things to try

Experiment with using the `flan-ul2` model for few-shot learning tasks, where you provide the model with a small number of examples to guide its understanding of a new task or problem. Additionally, you could fine-tune the model on a specific domain or dataset to further enhance its performance for your particular use case.

## Model Overview

The `gemma-2b-it` is an instruct-tuned version of the Gemma 2B language model from Google. Gemma is a family of open, state-of-the-art models designed for versatile text generation tasks like question answering, summarization, and reasoning. The 2B instruct model builds on the base Gemma 2B model with additional fine-tuning to improve its ability to follow instructions and generate coherent text in response to prompts.

Similar models in the Gemma family include the [Gemma 2B base model](https://aimodels.fyi/models/huggingFace/gemma-2b-google), the [Gemma 7B base model](https://aimodels.fyi/models/huggingFace/gemma-7b-google), and the [Gemma 7B instruct model](https://aimodels.fyi/models/huggingFace/gemma-7b-it-google). These models share the same underlying architecture and training approach, but differ in scale and the addition of the instruct-tuning step.

## Model Inputs and Outputs

### Inputs
- Text prompts or instructions that the model should generate content in response to, such as questions, writing tasks, or open-ended requests.

### Outputs
- Generated English-language text that responds to the input prompt or instruction, such as an answer to a question, a summary of a document, or creative writing.

## Capabilities

The `gemma-2b-it` model is capable of generating high-quality text output across a variety of tasks. For example, it can answer questions, write creative stories, summarize documents, and explain complex topics. The model's performance has been evaluated on a range of benchmarks, showing strong results compared to other open models of similar size.

## What Can I Use it For?

The `gemma-2b-it` model is well-suited for a wide range of natural language processing applications:

- **Content Creation**: Use the model to generate draft text for marketing copy, scripts, emails, or other creative writing tasks.
- **Conversational AI**: Integrate the model into chatbots or virtual assistants to power more natural and engaging conversations.
- **Research and Education**: Leverage the model as a foundation for further NLP research or to create interactive learning tools.

By providing a high-performance yet accessible open model, Google hopes to democratize access to state-of-the-art language AI and foster innovation across many domains.

## Things to Try

One interesting aspect of the `gemma-2b-it` model is its ability to follow instructions and generate text that aligns with specific prompts or objectives. You could experiment with giving the model detailed instructions or multi-step tasks and observe how it responds. For example, try asking it to write a short story about a specific theme, or have it summarize a research paper in a concise way. The model's flexibility and coherence in these types of guided tasks is a key strength.

Another area to explore is the model's performance on more technical or specialized language, such as code generation, mathematical reasoning, or scientific writing. The diverse training data used for Gemma models is designed to expose them to a wide range of linguistic styles and domains, so they may be able to handle these types of inputs more effectively than some other language models.

[](#model-card-for-flan-t5-large)Model Card for FLAN-T5 large
=============================================================

![drawing](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/flan2_architecture.jpg)

[](#table-of-contents)Table of Contents
=======================================

0.  [TL;DR](#TL;DR)
1.  [Model Details](#model-details)
2.  [Usage](#usage)
3.  [Uses](#uses)
4.  [Bias, Risks, and Limitations](#bias-risks-and-limitations)
5.  [Training Details](#training-details)
6.  [Evaluation](#evaluation)
7.  [Environmental Impact](#environmental-impact)
8.  [Citation](#citation)
9.  [Model Card Authors](#model-card-authors)

[](#tldr)TL;DR
==============

If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. As mentioned in the first few lines of the abstract :

> Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

**Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the [T5 model card](https://huggingface.co/t5-large).

[](#model-details)Model Details
===============================

[](#model-description)Model Description
---------------------------------------

*   **Model type:** Language model
*   **Language(s) (NLP):** English, Spanish, Japanese, Persian, Hindi, French, Chinese, Bengali, Gujarati, German, Telugu, Italian, Arabic, Polish, Tamil, Marathi, Malayalam, Oriya, Panjabi, Portuguese, Urdu, Galician, Hebrew, Korean, Catalan, Thai, Dutch, Indonesian, Vietnamese, Bulgarian, Filipino, Central Khmer, Lao, Turkish, Russian, Croatian, Swedish, Yoruba, Kurdish, Burmese, Malay, Czech, Finnish, Somali, Tagalog, Swahili, Sinhala, Kannada, Zhuang, Igbo, Xhosa, Romanian, Haitian, Estonian, Slovak, Lithuanian, Greek, Nepali, Assamese, Norwegian
*   **License:** Apache 2.0
*   **Related Models:** [All FLAN-T5 Checkpoints](https://huggingface.co/models?search=flan-t5)
*   **Original Checkpoints:** [All Original FLAN-T5 Checkpoints](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints)
*   **Resources for more information:**
    *   [Research paper](https://arxiv.org/pdf/2210.11416.pdf)
    *   [GitHub Repo](https://github.com/google-research/t5x)
    *   [Hugging Face FLAN-T5 Docs (Similar to T5)](https://huggingface.co/docs/transformers/model_doc/t5)

[](#usage)Usage
===============

Find below some example scripts on how to use the model in `transformers`:

[](#using-the-pytorch-model)Using the Pytorch model
---------------------------------------------------

### [](#running-the-model-on-a-cpu)Running the model on a CPU

Click to expand

    
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

### [](#running-the-model-on-a-gpu)Running the model on a GPU

Click to expand

    # pip install accelerate
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

### [](#running-the-model-on-a-gpu-using-different-precisions)Running the model on a GPU using different precisions

#### [](#fp16)FP16

Click to expand

    # pip install accelerate
    import torch
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

#### [](#int8)INT8

Click to expand

    # pip install bitsandbytes accelerate
    from transformers import T5Tokenizer, T5ForConditionalGeneration
    
    tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", load_in_8bit=True)
    
    input_text = "translate English to German: How old are you?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0]))

[](#uses)Uses
=============

[](#direct-use-and-downstream-use)Direct Use and Downstream Use
---------------------------------------------------------------

The authors write in [the original paper's model card](https://arxiv.org/pdf/2210.11416.pdf) that:

> The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models

See the [research paper](https://arxiv.org/pdf/2210.11416.pdf) for further details.

[](#out-of-scope-use)Out-of-Scope Use
-------------------------------------

More information needed.

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
===========================================================

The information below in this section are copied from the model's [official model card](https://arxiv.org/pdf/2210.11416.pdf):

> Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

[](#ethical-considerations-and-risks)Ethical considerations and risks
---------------------------------------------------------------------

> Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.

[](#known-limitations)Known Limitations
---------------------------------------

> Flan-T5 has not been tested in real world applications.

[](#sensitive-use)Sensitive Use:
--------------------------------

> Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech.

[](#training-details)Training Details
=====================================

[](#training-data)Training Data
-------------------------------

The model was trained on a mixture of tasks, that includes the tasks described in the table below (from the original paper, figure 2):

[![table.png](https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png)](https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png)

[](#training-procedure)Training Procedure
-----------------------------------------

According to the model card from the [original paper](https://arxiv.org/pdf/2210.11416.pdf):

> These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size.

The model has been trained on TPU v3 or TPU v4 pods, using [`t5x`](https://github.com/google-research/t5x) codebase together with [`jax`](https://github.com/google/jax).

[](#evaluation)Evaluation
=========================

[](#testing-data-factors--metrics)Testing Data, Factors & Metrics
-----------------------------------------------------------------

The authors evaluated the model on various tasks covering several languages (1836 in total). See the table below for some quantitative evaluation: [![image.png](https://s3.amazonaws.com/moonup/production/uploads/1668072995230-62441d1d9fdefb55a0b7d12c.png)](https://s3.amazonaws.com/moonup/production/uploads/1668072995230-62441d1d9fdefb55a0b7d12c.png) For full details, please check the [research paper](https://arxiv.org/pdf/2210.11416.pdf).

[](#results)Results
-------------------

For full results for FLAN-T5-Large, see the [research paper](https://arxiv.org/pdf/2210.11416.pdf), Table 3.

[](#environmental-impact)Environmental Impact
=============================================

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

*   **Hardware Type:** Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips  4.
*   **Hours used:** More information needed
*   **Cloud Provider:** GCP
*   **Compute Region:** More information needed
*   **Carbon Emitted:** More information needed

[](#citation)Citation
=====================

**BibTeX:**

    @misc{https://doi.org/10.48550/arxiv.2210.11416,
      doi = {10.48550/ARXIV.2210.11416},
      
      url = {https://arxiv.org/abs/2210.11416},
      
      author = {Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Narang, Sharan and Mishra, Gaurav and Yu, Adams and Zhao, Vincent and Huang, Yanping and Dai, Andrew and Yu, Hongkun and Petrov, Slav and Chi, Ed H. and Dean, Jeff and Devlin, Jacob and Roberts, Adam and Zhou, Denny and Le, Quoc V. and Wei, Jason},
      
      keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
      
      title = {Scaling Instruction-Finetuned Language Models},
      
      publisher = {arXiv},
      
      year = {2022},
      
      copyright = {Creative Commons Attribution 4.0 International}
    }

## Model overview

The `flan-t5-large` model is a large language model developed by Google and released through Hugging Face. It is an improvement upon the popular T5 model, with enhanced performance on a wide range of tasks and languages. Compared to the base T5 model, `flan-t5-large` has been fine-tuned on over 1,000 additional tasks, covering a broader set of languages including English, Spanish, Japanese, French, and many others. This fine-tuning process, known as "instruction finetuning", helps the model achieve state-of-the-art performance on benchmarks like MMLU.

The `flan-t5-xxl` and `flan-t5-base` models are similar, larger and smaller variants of the `flan-t5-large` model, respectively. These models follow the same architectural improvements and fine-tuning process, but with different parameter sizes. The `flan-ul2` model is another related model, built by TII, that uses a unified training approach to achieve strong performance across a variety of tasks.

## Model inputs and outputs

### Inputs
- **Text**: The `flan-t5-large` model accepts text as input, which can be in the form of a single sequence or paired sequences (e.g., for tasks like translation or question answering).

### Outputs
- **Text**: The model generates text as output, which can be used for a variety of natural language processing tasks such as summarization, translation, and question answering.

## Capabilities

The `flan-t5-large` model excels at a wide range of natural language processing tasks, including text generation, question answering, summarization, and translation. Its performance is significantly improved compared to the base T5 model, thanks to the extensive fine-tuning on a diverse set of tasks and languages. For example, the [research paper](https://arxiv.org/pdf/2210.11416.pdf) reports that the `flan-t5-xxl` model achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU.

## What can I use it for?

The `flan-t5-large` model is well-suited for research on language models, including exploring zero-shot and few-shot learning on various NLP tasks. It can also be used as a foundation for further specialization and fine-tuning on specific use cases, such as chatbots, content generation, and question answering systems. The [paper](https://arxiv.org/pdf/2210.11416.pdf) suggests that the model should not be used directly in any application without a prior assessment of safety and fairness concerns.

## Things to try

One interesting aspect of the `flan-t5-large` model is its ability to handle a diverse set of languages, including English, Spanish, Japanese, and many others. Researchers and developers can explore the model's performance on cross-lingual tasks, such as translating between these languages or building multilingual applications. Additionally, the model's strong few-shot learning capabilities can be leveraged to quickly adapt it to new domains or tasks with limited fine-tuning data.