[](#bart-large-mnli)bart-large-mnli
===================================

This is the checkpoint for [bart-large](https://huggingface.co/facebook/bart-large) after being trained on the [MultiNLI (MNLI)](https://huggingface.co/datasets/multi_nli) dataset.

Additional information about this model:

*   The [bart-large](https://huggingface.co/facebook/bart-large) model page
*   [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
*   [BART fairseq implementation](https://github.com/pytorch/fairseq/tree/master/fairseq/models/bart)

[](#nli-based-zero-shot-text-classification)NLI-based Zero Shot Text Classification
-----------------------------------------------------------------------------------

[Yin et al.](https://arxiv.org/abs/1909.00161) proposed a method for using pre-trained NLI models as a ready-made zero-shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. For example, if we want to evaluate whether a sequence belongs to the class "politics", we could construct a hypothesis of `This text is about politics.`. The probabilities for entailment and contradiction are then converted to label probabilities.

This method is surprisingly effective in many cases, particularly when used with larger pre-trained models like BART and Roberta. See [this blog post](https://joeddav.github.io/blog/2020/05/29/ZSL.html) for a more expansive introduction to this and other zero shot methods, and see the code snippets below for examples of using this model for zero-shot classification both with Hugging Face's built-in pipeline and with native Transformers/PyTorch code.

#### [](#with-the-zero-shot-classification-pipeline)With the zero-shot classification pipeline

The model can be loaded with the `zero-shot-classification` pipeline like so:

    from transformers import pipeline
    classifier = pipeline("zero-shot-classification",
                          model="facebook/bart-large-mnli")
    

You can then use this pipeline to classify sequences into any of the class names you specify.

    sequence_to_classify = "one day I will see the world"
    candidate_labels = ['travel', 'cooking', 'dancing']
    classifier(sequence_to_classify, candidate_labels)
    #{'labels': ['travel', 'dancing', 'cooking'],
    # 'scores': [0.9938651323318481, 0.0032737774308770895, 0.002861034357920289],
    # 'sequence': 'one day I will see the world'}
    

If more than one candidate label can be correct, pass `multi_label=True` to calculate each class independently:

    candidate_labels = ['travel', 'cooking', 'dancing', 'exploration']
    classifier(sequence_to_classify, candidate_labels, multi_label=True)
    #{'labels': ['travel', 'exploration', 'dancing', 'cooking'],
    # 'scores': [0.9945111274719238,
    #  0.9383890628814697,
    #  0.0057061901316046715,
    #  0.0018193122232332826],
    # 'sequence': 'one day I will see the world'}
    

#### [](#with-manual-pytorch)With manual PyTorch

    # pose sequence as a NLI premise and label as a hypothesis
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    nli_model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
    tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
    
    premise = sequence
    hypothesis = f'This example is {label}.'
    
    # run through model pre-trained on MNLI
    x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                         truncation_strategy='only_first')
    logits = nli_model(x.to(device))[0]
    
    # we throw away "neutral" (dim 1) and take the probability of
    # "entailment" (2) as the probability of the label being true 
    entail_contradiction_logits = logits[:,[0,2]]
    probs = entail_contradiction_logits.softmax(dim=1)
    prob_label_is_true = probs[:,1]

## Model overview

The `bart-large-mnli` model is a checkpoint of the BART-large model that has been fine-tuned on the [MultiNLI (MNLI)](https://huggingface.co/datasets/multi_nli) dataset. BART is a denoising autoencoder for pretraining sequence-to-sequence models, developed by researchers at [Facebook](https://aimodels.fyi/creators/huggingFace/facebook). The MNLI dataset is a large-scale [natural language inference](https://en.wikipedia.org/wiki/Natural_language_inference) dataset, making the `bart-large-mnli` model well-suited for text classification and logical reasoning tasks.

Similar models include the [BERT base model](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert), which was also pretrained on a large corpus of text and is commonly used as a starting point for fine-tuning on downstream tasks. Another related model is [TinyLlama-1.1B](https://aimodels.fyi/models/huggingFace/tinyllama-11b-chat-v10-tinyllama), a 1.1 billion parameter model based on the Llama architecture that has been finetuned for chatbot-style interactions.

## Model inputs and outputs

### Inputs
- **Text sequences**: The `bart-large-mnli` model takes in text sequences as input, which can be used for tasks like text classification, natural language inference, and more.

### Outputs
- **Logits**: The model outputs logits, which can be converted to probabilities and used to predict the most likely label or class for a given input text.
- **Embeddings**: The model can also be used to extract contextual word or sentence embeddings, which can be useful features for downstream machine learning tasks.

## Capabilities

The `bart-large-mnli` model is particularly well-suited for text classification and natural language inference tasks. For example, it can be used to classify whether a piece of text is positive, negative, or neutral in sentiment, or to determine if one sentence logically entails or contradicts another. 

The model has also been shown to be effective for zero-shot text classification, where the model is able to classify text into categories it wasn't explicitly trained on. This is done by framing the classification task as a natural language inference problem, where the input text is the "premise" and the candidate labels are converted into "hypotheses" that the model evaluates.

## What can I use it for?

The `bart-large-mnli` model can be a powerful starting point for a variety of natural language processing applications. Some potential use cases include:

- **Text classification**: Classifying text into predefined categories like sentiment, topic, or intent.
- **Natural language inference**: Determining logical relationships between sentences, such as entailment, contradiction, or neutrality.
- **Zero-shot classification**: Extending the model's classification capabilities to new domains or tasks without additional training.
- **Extracting text embeddings**: Using the model's contextual embeddings as features for downstream machine learning tasks.

## Things to try

One interesting aspect of the `bart-large-mnli` model is its ability to perform zero-shot text classification. To try this, you can experiment with constructing hypotheses for different candidate labels and seeing how the model evaluates the input text against those hypotheses.

Another interesting direction could be to explore using the model's text embeddings for tasks like text similarity, clustering, or retrieval. The contextual nature of the embeddings may capture nuanced semantic relationships that could be valuable for these kinds of applications.

Overall, the `bart-large-mnli` model provides a strong foundation for a variety of natural language processing tasks, and its flexible architecture and pretraining make it a versatile tool for researchers and developers to experiment with.

[](#bart-large-sized-model-fine-tuned-on-cnn-daily-mail)BART (large-sized model), fine-tuned on CNN Daily Mail
==============================================================================================================

BART model pre-trained on English language, and fine-tuned on [CNN Daily Mail](https://huggingface.co/datasets/cnn_dailymail). It was introduced in the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Lewis et al. and first released in \[this repository ([https://github.com/pytorch/fairseq/tree/master/examples/bart](https://github.com/pytorch/fairseq/tree/master/examples/bart)).

Disclaimer: The team releasing BART did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

BART is a transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.

BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering). This particular checkpoint has been fine-tuned on CNN Daily Mail, a large collection of text-summary pairs.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use this model for text summarization.

### [](#how-to-use)How to use

Here is how to use this model with the [pipeline API](https://huggingface.co/transformers/main_classes/pipelines.html):

    from transformers import pipeline
    
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
    ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
    A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
    Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
    In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
    Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
    2010 marriage license application, according to court documents.
    Prosecutors said the marriages were part of an immigration scam.
    On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
    After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
    Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
    All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
    Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
    Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
    The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
    Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
    Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
    If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
    """
    print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
    >>> [{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]
    

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-1910-13461,
      author    = {Mike Lewis and
                   Yinhan Liu and
                   Naman Goyal and
                   Marjan Ghazvininejad and
                   Abdelrahman Mohamed and
                   Omer Levy and
                   Veselin Stoyanov and
                   Luke Zettlemoyer},
      title     = {{BART:} Denoising Sequence-to-Sequence Pre-training for Natural Language
                   Generation, Translation, and Comprehension},
      journal   = {CoRR},
      volume    = {abs/1910.13461},
      year      = {2019},
      url       = {http://arxiv.org/abs/1910.13461},
      eprinttype = {arXiv},
      eprint    = {1910.13461},
      timestamp = {Thu, 31 Oct 2019 14:02:26 +0100},
      biburl    = {https://dblp.org/rec/journals/corr/abs-1910-13461.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }

## Model overview

The `bart-large-cnn` model is a large-sized BART model that has been fine-tuned on the CNN Daily Mail dataset. BART is a transformer encoder-decoder model that was introduced in the paper "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension" by Lewis et al. The model was initially released in the [fairseq repository](https://github.com/pytorch/fairseq/tree/master/examples/bart). This particular checkpoint has been fine-tuned for text summarization tasks.

The `mbart-large-50` model is a multilingual sequence-to-sequence model that was introduced in the paper "Multilingual Translation with Extensible Multilingual Pretraining and Finetuning". It is a multilingual extension of the original mBART model, covering a total of 50 languages. The model was pre-trained using a "Multilingual Denoising Pretraining" objective, where the model is tasked with reconstructing the original text from a noised version.

The `roberta-large` model is a large-sized RoBERTa model, which is a transformer model pre-trained on a large corpus of English data using a masked language modeling (MLM) objective. RoBERTa was introduced in the paper "RoBERTa: A Robustly Optimized BERT Pretraining Approach" and was first released in the [fairseq repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta).

The `bert-large-uncased` and `bert-base-uncased` models are large and base-sized BERT models, respectively, that were pre-trained on a large corpus of English data using a masked language modeling (MLM) objective and a next sentence prediction (NSP) objective. BERT was introduced in the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" and first released in the [Google-research/BERT repository](https://github.com/google-research/bert).

The `bert-base-multilingual-uncased` model is a multilingual base-sized BERT model that was pre-trained on the 102 languages with the largest Wikipedias using the same MLM and NSP objectives as the English BERT models.

## Model inputs and outputs

### Inputs
- **Text**: The `bart-large-cnn` model takes text as input, which can be used for tasks like text summarization.

### Outputs
- **Text**: The `bart-large-cnn` model generates text as output, which can be used for tasks like summarizing long-form text.

## Capabilities

The `bart-large-cnn` model is particularly effective when fine-tuned for text generation tasks, such as summarization. It can take in a long-form text and generate a concise summary. The model's bidirectional encoder and autoregressive decoder allow it to capture both the context of the full text and generate fluent, coherent summaries.

## What can I use it for?

You can use the `bart-large-cnn` model for text summarization tasks, such as summarizing news articles, academic papers, or other long-form text. By fine-tuning the model on your own dataset, you can create a customized summarization system tailored to your domain or use case.

## Things to try

Try fine-tuning the `bart-large-cnn` model on your own text summarization dataset to see how it performs on your specific use case. You can also experiment with different hyperparameters, such as the learning rate or batch size, to optimize the model's performance. Additionally, you could try combining the `bart-large-cnn` model with other NLP techniques, such as extractive summarization or topic modeling, to create a more sophisticated summarization system.

[](#detr-end-to-end-object-detection-model-with-resnet-50-backbone)DETR (End-to-End Object Detection) model with ResNet-50 backbone
===================================================================================================================================

DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Carion et al. and first released in [this repository](https://github.com/facebookresearch/detr).

Disclaimer: The team releasing DETR did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

The DETR model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for the class labels and a MLP (multi-layer perceptron) for the bounding boxes. The model uses so-called object queries to detect objects in an image. Each object query looks for a particular object in the image. For COCO, the number of object queries is set to 100.

The model is trained using a "bipartite matching loss": one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model.

[![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/detr_architecture.png)](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/detr_architecture.png)

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=facebook/detr) to look for all available DETR models.

### [](#how-to-use)How to use

Here is how to use this model:

    from transformers import DetrImageProcessor, DetrForObjectDetection
    import torch
    from PIL import Image
    import requests
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    # you can specify the revision tag if you don't want the timm dependency
    processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50", revision="no_timm")
    model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50", revision="no_timm")
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    
    # convert outputs (bounding boxes and class logits) to COCO API
    # let's only keep detections with score > 0.9
    target_sizes = torch.tensor([image.size[::-1]])
    results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
    
    for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
        box = [round(i, 2) for i in box.tolist()]
        print(
                f"Detected {model.config.id2label[label.item()]} with confidence "
                f"{round(score.item(), 3)} at location {box}"
        )
    

This should output:

    Detected remote with confidence 0.998 at location [40.16, 70.81, 175.55, 117.98]
    Detected remote with confidence 0.996 at location [333.24, 72.55, 368.33, 187.66]
    Detected couch with confidence 0.995 at location [-0.02, 1.15, 639.73, 473.76]
    Detected cat with confidence 0.999 at location [13.24, 52.05, 314.02, 470.93]
    Detected cat with confidence 0.999 at location [345.4, 23.85, 640.37, 368.72]
    

Currently, both the feature extractor and model support PyTorch.

[](#training-data)Training data
-------------------------------

The DETR model was trained on [COCO 2017 object detection](https://cocodataset.org/#download), a dataset consisting of 118k/5k annotated images for training/validation respectively.

[](#training-procedure)Training procedure
-----------------------------------------

### [](#preprocessing)Preprocessing

The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py).

Images are resized/rescaled such that the shortest side is at least 800 pixels and the largest side at most 1333 pixels, and normalized across the RGB channels with the ImageNet mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225).

### [](#training)Training

The model was trained for 300 epochs on 16 V100 GPUs. This takes 3 days, with 4 images per GPU (hence a total batch size of 64).

[](#evaluation-results)Evaluation results
-----------------------------------------

This model achieves an AP (average precision) of **42.0** on COCO 2017 validation. For more details regarding evaluation results, we refer to table 1 of the original paper.

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-2005-12872,
      author    = {Nicolas Carion and
                   Francisco Massa and
                   Gabriel Synnaeve and
                   Nicolas Usunier and
                   Alexander Kirillov and
                   Sergey Zagoruyko},
      title     = {End-to-End Object Detection with Transformers},
      journal   = {CoRR},
      volume    = {abs/2005.12872},
      year      = {2020},
      url       = {https://arxiv.org/abs/2005.12872},
      archivePrefix = {arXiv},
      eprint    = {2005.12872},
      timestamp = {Thu, 28 May 2020 17:38:09 +0200},
      biburl    = {https://dblp.org/rec/journals/corr/abs-2005-12872.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }

## Model overview

The `detr-resnet-50` model is an End-to-End Object Detection (DETR) model with a ResNet-50 backbone. It was developed by the Facebook research team and introduced in the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872). The model is trained end-to-end on the COCO 2017 object detection dataset, which contains 118k annotated images.

The DETR model uses a transformer encoder-decoder architecture with a convolutional backbone to perform object detection. It takes an image as input and outputs a set of detected objects, including their class labels and bounding box coordinates. The model uses "object queries" to detect objects, where each query looks for a particular object in the image. For COCO, the number of object queries is set to 100.

Similar models include the [detr-resnet-50-panoptic](https://aimodels.fyi/models/huggingFace/detr-resnet-50-panoptic-facebook) model, which is trained for panoptic segmentation, and the [detr-resnet-101](https://aimodels.fyi/models/huggingFace/detr-resnet-101-facebook) model, which uses a larger ResNet-101 backbone.

## Model inputs and outputs

### Inputs
- **Images**: The model takes in an image as input, which is resized and normalized before being processed.

### Outputs
- **Object detections**: The model outputs a set of detected objects, including their class labels and bounding box coordinates.

## Capabilities

The `detr-resnet-50` model can be used for object detection in images. It is able to identify and localize a variety of common objects, such as people, vehicles, animals, and household items. The model achieves strong performance on the COCO 2017 dataset, with an average precision (AP) of 38.8.

## What can I use it for?

You can use the `detr-resnet-50` model for a variety of computer vision applications that involve object detection, such as:

- **Autonomous vehicles**: Detect and track objects like pedestrians, other vehicles, and obstacles to aid in navigation and collision avoidance.
- **Surveillance and security**: Identify and localize people, vehicles, and other objects of interest in security camera footage.
- **Retail and logistics**: Detect and count items in warehouses or on store shelves to improve inventory management.
- **Robotics**: Enable robots to perceive and interact with objects in their environment.

## Things to try

One interesting aspect of the DETR model is its use of "object queries" to detect objects. You could experiment with varying the number of object queries or using different types of object queries to see how it affects the model's performance and capabilities. Additionally, you could try fine-tuning the model on a specific domain or dataset to see if it can achieve even better results for your particular use case.

[](#seamlessm4t-v2)SeamlessM4T v2
=================================

**SeamlessM4T** is our foundational all-in-one **M**assively **M**ultilingual and **M**ultimodal **M**achine **T**ranslation model delivering high-quality translation for speech and text in nearly 100 languages.

SeamlessM4T models support the tasks of:

*   Speech-to-speech translation (S2ST)
*   Speech-to-text translation (S2TT)
*   Text-to-speech translation (T2ST)
*   Text-to-text translation (T2TT)
*   Automatic speech recognition (ASR).

SeamlessM4T models support:

*    101 languages for speech input.
*    96 Languages for text input/output.
*    35 languages for speech output.

 We are releasing SeamlessM4T v2, an updated version with our novel _UnitY2_ architecture. This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.

The v2 version of SeamlessM4T is a multitask adaptation of our novel _UnitY2_ architecture. _Unity2_ with its hierarchical character-to-unit upsampling and non-autoregressive text-to-unit decoding considerably improves over SeamlessM4T v1 in quality and inference speed.

**SeamlessM4T v2 is also supported by  Transformers, more on it [in the dedicated section below](#transformers-usage).**

[![SeamlessM4T architectures](/facebook/seamless-m4t-v2-large/resolve/main/seamlessm4t_arch.svg)](/facebook/seamless-m4t-v2-large/blob/main/seamlessm4t_arch.svg)

[](#seamlessm4t--models)SeamlessM4T models
------------------------------------------

Model Name

#params

checkpoint

metrics

[SeamlessM4T-Large v2](https://huggingface.co/facebook/seamless-m4t-v2-large)

2.3B

[checkpoint](https://huggingface.co/facebook/seamless-m4t-v2-large/blob/main/seamlessM4T_v2_large.pt)

[metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip)

[SeamlessM4T-Large (v1)](https://huggingface.co/facebook/seamless-m4t-large)

2.3B

[checkpoint](https://huggingface.co/facebook/seamless-m4t-large/blob/main/multitask_unity_large.pt)

[metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip)

[SeamlessM4T-Medium (v1)](https://huggingface.co/facebook/seamless-m4t-medium)

1.2B

[checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/blob/main/multitask_unity_medium.pt)

[metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip)

We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T-Medium reported in the paper (as averages) in the `metrics` files above.

The evaluation data ids for FLEURS, CoVoST2 and CVSS-C can be found [here](https://dl.fbaipublicfiles.com/seamless/metrics/evaluation_data_ids.zip)

[](#evaluating-seamlessm4t-models)Evaluating SeamlessM4T models
---------------------------------------------------------------

To reproduce our results or to evaluate using the same metrics over your own test sets, please check out the [Evaluation README here](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/m4t/evaluate).

[](#finetuning-seamlessm4t-models)Finetuning SeamlessM4T models
---------------------------------------------------------------

Please check out the [Finetuning README here](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/m4t/finetune).

[](#transformers-usage)Transformers usage
-----------------------------------------

SeamlessM4T is available in the  Transformers library, requiring minimal dependencies. Steps to get started:

1.  First install the  [Transformers library](https://github.com/huggingface/transformers) from main and [sentencepiece](https://github.com/google/sentencepiece):

    pip install git+https://github.com/huggingface/transformers.git sentencepiece
    

2.  Run the following Python code to generate speech samples. Here the target language is Russian:

    from transformers import AutoProcessor, SeamlessM4Tv2Model
    import torchaudio
    
    processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
    model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
    
    # from text
    text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
    audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
    
    # from audio
    audio, orig_freq =  torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
    audio =  torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
    audio_inputs = processor(audios=audio, return_tensors="pt")
    audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
    

3.  Listen to the audio samples either in an ipynb notebook:

    from IPython.display import Audio
    
    sample_rate = model.config.sampling_rate
    Audio(audio_array_from_text, rate=sample_rate)
    # Audio(audio_array_from_audio, rate=sample_rate)
    

Or save them as a `.wav` file using a third-party library, e.g. `scipy`:

    import scipy
    
    sample_rate = model.config.sampling_rate
    scipy.io.wavfile.write("out_from_text.wav", rate=sample_rate, data=audio_array_from_text)
    # scipy.io.wavfile.write("out_from_audio.wav", rate=sample_rate, data=audio_array_from_audio)
    

For more details on using the SeamlessM4T model for inference using the  Transformers library, refer to the **[SeamlessM4T v2 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2)** or to this **hands-on [Google Colab](https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/v2_seamless_m4t_hugging_face.ipynb).**

[](#supported-languages)Supported Languages:
--------------------------------------------

Listed below, are the languages supported by SeamlessM4T-large (v1/v2). The `source` column specifies whether a language is supported as source speech (`Sp`) and/or source text (`Tx`). The `target` column specifies whether a language is supported as target speech (`Sp`) and/or target text (`Tx`).

code

language

script

Source

Target

afr

Afrikaans

Latn

Sp, Tx

Tx

amh

Amharic

Ethi

Sp, Tx

Tx

arb

Modern Standard Arabic

Arab

Sp, Tx

Sp, Tx

ary

Moroccan Arabic

Arab

Sp, Tx

Tx

arz

Egyptian Arabic

Arab

Sp, Tx

Tx

asm

Assamese

Beng

Sp, Tx

Tx

ast

Asturian

Latn

Sp

\--

azj

North Azerbaijani

Latn

Sp, Tx

Tx

bel

Belarusian

Cyrl

Sp, Tx

Tx

ben

Bengali

Beng

Sp, Tx

Sp, Tx

bos

Bosnian

Latn

Sp, Tx

Tx

bul

Bulgarian

Cyrl

Sp, Tx

Tx

cat

Catalan

Latn

Sp, Tx

Sp, Tx

ceb

Cebuano

Latn

Sp, Tx

Tx

ces

Czech

Latn

Sp, Tx

Sp, Tx

ckb

Central Kurdish

Arab

Sp, Tx

Tx

cmn

Mandarin Chinese

Hans

Sp, Tx

Sp, Tx

cmn\_Hant

Mandarin Chinese

Hant

Sp, Tx

Sp, Tx

cym

Welsh

Latn

Sp, Tx

Sp, Tx

dan

Danish

Latn

Sp, Tx

Sp, Tx

deu

German

Latn

Sp, Tx

Sp, Tx

ell

Greek

Grek

Sp, Tx

Tx

eng

English

Latn

Sp, Tx

Sp, Tx

est

Estonian

Latn

Sp, Tx

Sp, Tx

eus

Basque

Latn

Sp, Tx

Tx

fin

Finnish

Latn

Sp, Tx

Sp, Tx

fra

French

Latn

Sp, Tx

Sp, Tx

fuv

Nigerian Fulfulde

Latn

Sp, Tx

Tx

gaz

West Central Oromo

Latn

Sp, Tx

Tx

gle

Irish

Latn

Sp, Tx

Tx

glg

Galician

Latn

Sp, Tx

Tx

guj

Gujarati

Gujr

Sp, Tx

Tx

heb

Hebrew

Hebr

Sp, Tx

Tx

hin

Hindi

Deva

Sp, Tx

Sp, Tx

hrv

Croatian

Latn

Sp, Tx

Tx

hun

Hungarian

Latn

Sp, Tx

Tx

hye

Armenian

Armn

Sp, Tx

Tx

ibo

Igbo

Latn

Sp, Tx

Tx

ind

Indonesian

Latn

Sp, Tx

Sp, Tx

isl

Icelandic

Latn

Sp, Tx

Tx

ita

Italian

Latn

Sp, Tx

Sp, Tx

jav

Javanese

Latn

Sp, Tx

Tx

jpn

Japanese

Jpan

Sp, Tx

Sp, Tx

kam

Kamba

Latn

Sp

\--

kan

Kannada

Knda

Sp, Tx

Tx

kat

Georgian

Geor

Sp, Tx

Tx

kaz

Kazakh

Cyrl

Sp, Tx

Tx

kea

Kabuverdianu

Latn

Sp

\--

khk

Halh Mongolian

Cyrl

Sp, Tx

Tx

khm

Khmer

Khmr

Sp, Tx

Tx

kir

Kyrgyz

Cyrl

Sp, Tx

Tx

kor

Korean

Kore

Sp, Tx

Sp, Tx

lao

Lao

Laoo

Sp, Tx

Tx

lit

Lithuanian

Latn

Sp, Tx

Tx

ltz

Luxembourgish

Latn

Sp

\--

lug

Ganda

Latn

Sp, Tx

Tx

luo

Luo

Latn

Sp, Tx

Tx

lvs

Standard Latvian

Latn

Sp, Tx

Tx

mai

Maithili

Deva

Sp, Tx

Tx

mal

Malayalam

Mlym

Sp, Tx

Tx

mar

Marathi

Deva

Sp, Tx

Tx

mkd

Macedonian

Cyrl

Sp, Tx

Tx

mlt

Maltese

Latn

Sp, Tx

Sp, Tx

mni

Meitei

Beng

Sp, Tx

Tx

mya

Burmese

Mymr

Sp, Tx

Tx

nld

Dutch

Latn

Sp, Tx

Sp, Tx

nno

Norwegian Nynorsk

Latn

Sp, Tx

Tx

nob

Norwegian Bokml

Latn

Sp, Tx

Tx

npi

Nepali

Deva

Sp, Tx

Tx

nya

Nyanja

Latn

Sp, Tx

Tx

oci

Occitan

Latn

Sp

\--

ory

Odia

Orya

Sp, Tx

Tx

pan

Punjabi

Guru

Sp, Tx

Tx

pbt

Southern Pashto

Arab

Sp, Tx

Tx

pes

Western Persian

Arab

Sp, Tx

Sp, Tx

pol

Polish

Latn

Sp, Tx

Sp, Tx

por

Portuguese

Latn

Sp, Tx

Sp, Tx

ron

Romanian

Latn

Sp, Tx

Sp, Tx

rus

Russian

Cyrl

Sp, Tx

Sp, Tx

slk

Slovak

Latn

Sp, Tx

Sp, Tx

slv

Slovenian

Latn

Sp, Tx

Tx

sna

Shona

Latn

Sp, Tx

Tx

snd

Sindhi

Arab

Sp, Tx

Tx

som

Somali

Latn

Sp, Tx

Tx

spa

Spanish

Latn

Sp, Tx

Sp, Tx

srp

Serbian

Cyrl

Sp, Tx

Tx

swe

Swedish

Latn

Sp, Tx

Sp, Tx

swh

Swahili

Latn

Sp, Tx

Sp, Tx

tam

Tamil

Taml

Sp, Tx

Tx

tel

Telugu

Telu

Sp, Tx

Sp, Tx

tgk

Tajik

Cyrl

Sp, Tx

Tx

tgl

Tagalog

Latn

Sp, Tx

Sp, Tx

tha

Thai

Thai

Sp, Tx

Sp, Tx

tur

Turkish

Latn

Sp, Tx

Sp, Tx

ukr

Ukrainian

Cyrl

Sp, Tx

Sp, Tx

urd

Urdu

Arab

Sp, Tx

Sp, Tx

uzn

Northern Uzbek

Latn

Sp, Tx

Sp, Tx

vie

Vietnamese

Latn

Sp, Tx

Sp, Tx

xho

Xhosa

Latn

Sp

\--

yor

Yoruba

Latn

Sp, Tx

Tx

yue

Cantonese

Hant

Sp, Tx

Tx

zlm

Colloquial Malay

Latn

Sp

\--

zsm

Standard Malay

Latn

Tx

Tx

zul

Zulu

Latn

Sp, Tx

Tx

Note that seamlessM4T-medium supports 200 languages in the text modality, and is based on NLLB-200 (see full list in [asset card](https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/cards/unity_nllb-200.yaml))

[](#citation)Citation
---------------------

For SeamlessM4T v2, please cite :

    @inproceedings{seamless2023,
       title="Seamless: Multilingual Expressive and Streaming Speech Translation",
       author="{Seamless Communication}, Lo{\"i}c Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-juss{\`a}, Maha Elbayad, Hongyu Gong, Francisco Guzm{\'a}n, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson",
      journal={ArXiv},
      year={2023}
    }

## Model overview

`seamless-m4t-v2-large` is a foundational all-in-one Massively Multilingual and Multimodal Machine Translation (M4T) model developed by [Facebook](https://aimodels.fyi/creators/huggingFace/facebook). It delivers high-quality translation for speech and text in nearly 100 languages, supporting tasks such as speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition. 

The v2 version of SeamlessM4T uses a novel "UnitY2" architecture, which improves over the previous v1 model in both quality and inference speed for speech generation tasks. SeamlessM4T v2 is also supported by Transformers, allowing for easy integration into various natural language processing pipelines.

## Model inputs and outputs

### Inputs
- **Speech input**: The model supports 101 languages for speech input.
- **Text input**: The model supports 96 languages for text input.

### Outputs
- **Speech output**: The model supports 35 languages for speech output.
- **Text output**: The model supports 96 languages for text output.

## Capabilities

The SeamlessM4T v2-large model demonstrates strong performance across a range of multilingual and multimodal translation tasks, including speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation. It can also handle automatic speech recognition in multiple languages.

## What can I use it for?

The SeamlessM4T v2-large model is well-suited for building multilingual and multimodal translation applications, such as real-time translation for video conferencing, language learning tools, and international customer support services. Its broad language support and strong performance make it a valuable resource for researchers and developers working on cross-language communication.

## Things to try

One interesting aspect of the SeamlessM4T v2 model is its support for both speech and text input/output. This allows for building applications that can seamlessly switch between speech and text, enabling a more natural and fluid user experience. Developers could experiment with building prototypes that allow users to initiate a conversation in one modality and receive a response in another, or that automatically detect the user's preferred input method and adapt accordingly.

Another area to explore is the model's ability to translate between a wide range of languages. Developers could test the model's performance on less commonly translated language pairs, or investigate how it handles regional dialects and accents. This could lead to insights on the model's strengths and limitations, and inform the development of more robust multilingual systems.

[](#seamlessm4t-large-v1)SeamlessM4T Large (v1)
===============================================

SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T covers:

*    101 languages for speech input
*    96 Languages for text input/output
*    35 languages for speech output.

* * *

** SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).**

**This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**

**SeamlessM4T v2 is also supported by  Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [ Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**

* * *

This is the "large-v1" variant of SeamlessM4T, which enables multiple tasks without relying on multiple separate models:

*   Speech-to-speech translation (S2ST)
*   Speech-to-text translation (S2TT)
*   Text-to-speech translation (T2ST)
*   Text-to-text translation (T2TT)
*   Automatic speech recognition (ASR)

[](#seamlessm4t-models)SeamlessM4T models
-----------------------------------------

Model Name

#params

checkpoint

metrics

[SeamlessM4T-Large v2](https://huggingface.co/facebook/seamless-m4t-v2-large)

2.3B

[checkpoint](https://huggingface.co/facebook/seamless-m4t-v2-large/blob/main/seamlessM4T_v2_large.pt)

[metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip)

[SeamlessM4T-Large (v1)](https://huggingface.co/facebook/seamless-m4t-large)

2.3B

[checkpoint](https://huggingface.co/facebook/seamless-m4t-large/blob/main/multitask_unity_large.pt)

[metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip)

[SeamlessM4T-Medium (v1)](https://huggingface.co/facebook/seamless-m4t-medium)

1.2B

[checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/blob/main/multitask_unity_medium.pt)

[metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip)

We provide extensive evaluation results of SeamlessM4T models in the [SeamlessM4T](https://arxiv.org/abs/2308.11596) and [Seamless](https://arxiv.org/abs/2312.05187) papers (as averages) in the `metrics` files above.

[](#-transformers-usage) Transformers Usage
-----------------------------------------------

First, load the processor and a checkpoint of the model:

    import torchaudio
    from transformers import AutoProcessor, SeamlessM4TModel
    processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
    model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")
    

You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.

Here is how to use the processor to process text and audio:

    # Read an audio file and resample to 16kHz:
    audio, orig_freq =  torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
    audio =  torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
    audio_inputs = processor(audios=audio, return_tensors="pt")
    # Process some input text as well:
    text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
    

### [](#speech)Speech

Generate speech in Russian from either text (T2ST) or speech input (S2ST):

    audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
    audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
    

### [](#text)Text

Similarly, you can generate translated text from audio files (S2TT) or from text (T2TT, conventionally MT) with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).

    # from audio
    output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
    translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
    # from text
    output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
    translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
    

[](#seamless_communication)Seamless\_communication
--------------------------------------------------

You can also use the seamlessM4T models using the [`seamless_communication` library](https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/README.md) with either CLI:

    m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_large
    

or a `Translator` API:

    import torch
    from seamless_communication.inference import Translator
    # Initialize a Translator object with a multitask model, vocoder on the GPU.
    translator = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
    text_output, speech_output = translator.predict(
        input=<path_to_input_audio>,
        task_str="S2ST",
        tgt_lang=<tgt_lang>,
        text_generation_opts=text_generation_opts,
        unit_generation_opts=unit_generation_opts
    )
    

[](#citation)Citation
---------------------

If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:

    @article{seamlessm4t2023,
      title={"SeamlessM4TMassively Multilingual \& Multimodal Machine Translation"},
      author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye,  Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
      journal={ArXiv},
      year={2023}
    }
    

[](#license)License
-------------------

The Seamless Communication code and weights are CC-BY-NC 4.0 licensed.

## Model overview

The `seamless-m4t-large` model is a large version of the SeamlessM4T series of models designed by Facebook to provide high-quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. The model is a multitask adaptation that supports multiple translation tasks including speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation, as well as automatic speech recognition. Compared to the [SeamlessM4T-Large v2](https://aimodels.fyi/models/huggingFace/seamless-m4t-v2-large-facebook) model, the `seamless-m4t-large` model has the same architecture but was trained on a smaller dataset.

## Model inputs and outputs

The `seamless-m4t-large` model takes either speech or text as input and can produce either speech or text as output. It supports 101 languages for speech input, 96 languages for text input/output, and 35 languages for speech output.

### Inputs
- **Speech audio**: The model can take speech audio as input, which it can then translate to text in the target language.
- **Text**: The model can take text as input, which it can then translate to speech or text in the target language.

### Outputs
- **Translated speech**: The model can output translated speech in the target language.
- **Translated text**: The model can output translated text in the target language.

## Capabilities

The `seamless-m4t-large` model is capable of performing high-quality translation between a wide range of languages, both for speech and text. It can handle multiple translation tasks, including speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation. The model also supports automatic speech recognition, allowing it to transcribe speech to text.

## What can I use it for?

The `seamless-m4t-large` model could be used to build applications that enable effortless communication between people from different linguistic backgrounds. For example, it could be used to develop multilingual chatbots, video conferencing tools, or language learning apps. The model's support for both speech and text translation makes it suitable for a wide range of use cases.

## Things to try

One interesting thing to try with the `seamless-m4t-large` model would be to experiment with its ability to handle different translation tasks. For example, you could try using the model to translate a piece of text from one language to another, and then use the translated text as input to generate speech in the target language. This could be useful for building applications that need to seamlessly transition between text and speech translation.

Another interesting experiment would be to fine-tune the model on a specific domain or task, such as medical or legal translation, to see if it can improve its performance in those areas. The [provided resources on finetuning](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/m4t/finetune) could be a good starting point for exploring this.

[](#nllb-200)NLLB-200
=====================

This is the model card of NLLB-200's distilled 600M variant.

Here are the [metrics](https://tinyurl.com/nllb200densedst600mmetrics) for that particular checkpoint.

*   Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data and the strategies to handle data imbalances for high and low resource languages that were used to train NLLB-200 is described in the paper.
*   Paper or other resource for more information NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022
*   License: CC-BY-NC
*   Where to send questions or comments about the model: [https://github.com/facebookresearch/fairseq/issues](https://github.com/facebookresearch/fairseq/issues)

[](#intended-use)Intended Use
-----------------------------

*   Primary intended uses: NLLB-200 is a machine translation model primarily intended for research in machine translation, - especially for low-resource languages. It allows for single sentence translation among 200 languages. Information on how to - use the model can be found in Fairseq code repository along with the training code and references to evaluation and training data.
*   Primary intended users: Primary users are researchers and machine translation research community.
*   Out-of-scope use cases: NLLB-200 is a research model and is not released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB-200 translations can not be used as certified translations.

[](#metrics)Metrics
-------------------

 Model performance measures: NLLB-200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by machine translation community. Additionally, we performed human evaluation with the XSTS protocol and measured the toxicity of the generated translations.

[](#evaluation-data)Evaluation Data
-----------------------------------

*   Datasets: Flores-200 dataset is described in Section 4
*   Motivation: We used Flores-200 as it provides full evaluation coverage of the languages in NLLB-200
*   Preprocessing: Sentence-split raw text data was preprocessed using SentencePiece. The SentencePiece model is released along with NLLB-200.

[](#training-data)Training Data
-------------------------------

 We used parallel multilingual data from a variety of sources to train the model. We provide detailed report on data selection and construction process in Section 5 in the paper. We also used monolingual data constructed from Common Crawl. We provide more details in Section 5.2.

[](#ethical-considerations)Ethical Considerations
-------------------------------------------------

 In this work, we took a reflexive approach in technological development to ensure that we prioritize human users and minimize risks that could be transferred to them. While we reflect on our ethical considerations throughout the article, here are some additional points to highlight. For one, many languages chosen for this study are low-resource languages, with a heavy emphasis on African languages. While quality translation could improve education and information access in many in these communities, such an access could also make groups with lower levels of digital literacy more vulnerable to misinformation or online scams. The latter scenarios could arise if bad actors misappropriate our work for nefarious activities, which we conceive as an example of unintended use. Regarding data acquisition, the training data used for model development were mined from various publicly available sources on the web. Although we invested heavily in data cleaning, personally identifiable information may not be entirely eliminated. Finally, although we did our best to optimize for translation quality, mistranslations produced by the model could remain. Although the odds are low, this could have adverse impact on those who rely on these translations to make important decisions (particularly when related to health and safety).

[](#caveats-and-recommendations)Caveats and Recommendations
-----------------------------------------------------------

 Our model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB-MD. In addition, the supported languages may have variations that our model is not capturing. Users should make appropriate assessments.

[](#carbon-footprint-details)Carbon Footprint Details
-----------------------------------------------------

 The carbon dioxide (CO2e) estimate is reported in Section 8.8.

## Model overview
`nllb-200-distilled-600M` is a machine translation model developed by Facebook that can translate between 200 languages. It is a distilled version of the larger `nllb-200` model, with 600 million parameters. Like its larger counterpart, `nllb-200-distilled-600M` was trained on a diverse dataset spanning many low-resource languages, with the goal of providing high-quality translation capabilities across a broad range of languages. This model outperforms previous open-source translation models, especially for low-resource language pairs.

The `nllb-200-distilled-600M` model is part of the NLLB family of models, which also includes the larger `nllb-200-3.3B` variant. Both models were developed by the Facebook AI Research team and aim to push the boundaries of machine translation, particularly for underserved languages. The distilled 600M version offers a more compact and efficient model for applications where smaller size is important.

## Model inputs and outputs

### Inputs
- **Text**: The `nllb-200-distilled-600M` model takes single sentences as input and translates them between 200 supported languages.

### Outputs
- **Translated text**: The output of the model is the translated text in the target language. The model supports translation in both directions between any of the 200 languages.

## Capabilities
`nllb-200-distilled-600M` is a powerful multilingual translation model that can handle a wide variety of languages, including low-resource ones. It has been shown to outperform previous open-source models, especially on language pairs involving African and other underrepresented languages. The model can be used to enable communication and information access for communities that have historically had limited options for high-quality machine translation.

## What can I use it for?
The primary intended use of `nllb-200-distilled-600M` is for research in machine translation, with a focus on low-resource languages. Researchers can use the model to explore techniques for improving translation quality, especially for language pairs that have been underserved by previous translation systems.

While the model is not intended for production deployment, it could potentially be fine-tuned or adapted for certain real-world applications that require multilingual translation, such as supporting communication in international organizations, facilitating access to information for speakers of minority languages, or aiding in the localization of content and software. However, users should carefully evaluate the model's performance and limitations before deploying it in any mission-critical or high-stakes scenarios.

## Things to try
One interesting aspect of `nllb-200-distilled-600M` is its ability to translate between a wide range of language pairs, including many low-resource languages. Researchers could experiment with using the model as a starting point for fine-tuning on specific domains or tasks, to see if the model's broad capabilities can be leveraged to improve translation quality in targeted applications.

Additionally, the model's performance could be analyzed in depth to better understand its strengths and weaknesses across different language pairs and domains. This could inform future research directions and model development efforts to further advance the state of the art in multilingual machine translation.

[](#model-description)Model description
---------------------------------------

*   Paper: [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637)
*   [Original PARLAI Code](https://parl.ai/projects/recipes/)

### [](#abstract)Abstract

Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, both asking and answering questions, and displaying knowledge, empathy and personality appropriately, depending on the situation. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter neural models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.

## Model overview

The `blenderbot-400M-distill` is an open-domain chatbot model developed by Facebook. It is a variant of the Blenderbot series, which aims to build engaging and knowledgeable conversational AI. The model is built on a 400M parameter neural network and trained using the "Recipes for building an open-domain chatbot" approach. This method focuses on developing models with a range of conversational skills, such as providing engaging talking points, asking and answering questions, and displaying empathy and personality. The model is smaller than some other Blenderbot variants, such as [blenderbot-3B](https://aimodels.fyi/models/huggingFace/blenderbot-3b-facebook), but it maintains strong performance in multi-turn dialogue according to human evaluations.

## Model inputs and outputs

The `blenderbot-400M-distill` model is a text-to-text transformer that takes conversational messages as input and generates relevant responses. It can engage in open-ended dialogue, answering questions, and providing information on a wide range of topics.

### Inputs
- Text-based conversational messages from a user

### Outputs
- Relevant and engaging text-based responses to continue the conversation

## Capabilities

The `blenderbot-400M-distill` model demonstrates strong capabilities in open-domain conversation. It can fluently discuss a variety of topics, ask and answer questions, and display personality and empathy. The model is able to maintain coherence and flow in multi-turn dialogues, making it suitable for use in chatbot applications.

## What can I use it for?

The `blenderbot-400M-distill` model can be used to build conversational AI assistants for a variety of applications, such as customer service, personal assistance, and educational purposes. Its ability to engage in natural dialogue while displaying knowledge and personality makes it well-suited for creating engaging user experiences. Additionally, the model's smaller size compared to larger Blenderbot variants may make it more accessible for deployment on resource-constrained systems.

## Things to try

One interesting aspect of the `blenderbot-400M-distill` model is its potential to be combined with other AI technologies to create more advanced conversational systems. For example, integrating the model with knowledge bases or task-specific modules could enhance its capabilities in areas like information retrieval, task completion, and contextual understanding. Experimenting with different prompting techniques and fine-tuning approaches may also uncover novel use cases for the model.

[](#musicgen---large---33b)MusicGen - Large - 3.3B
==================================================

MusicGen is a text-to-music model capable of genreating high-quality music samples conditioned on text descriptions or audio prompts. It is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods, like MusicLM, MusicGen doesn't require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict them in parallel, thus having only 50 auto-regressive steps per second of audio.

MusicGen was published in [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by _Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Dfossez_.

Four checkpoints are released:

*   [small](https://huggingface.co/facebook/musicgen-small)
*   [medium](https://huggingface.co/facebook/musicgen-medium)
*   [**large** (this checkpoint)](https://huggingface.co/facebook/musicgen-large)
*   [melody](https://huggingface.co/facebook/musicgen-melody)

[](#example)Example
-------------------

Try out MusicGen yourself!

*   Audiocraft Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fxGqfg96RBUvGxZ1XXN07s3DthrKUl4-?usp=sharing)

*   Hugging Face Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb)

*   Hugging Face Demo:

[![Open in HuggingFace](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/facebook/MusicGen)

[](#-transformers-usage) Transformers Usage
-----------------------------------------------

You can run MusicGen locally with the  Transformers library from version 4.31.0 onwards.

1.  First install the  [Transformers library](https://github.com/huggingface/transformers) and scipy:

    pip install --upgrade pip
    pip install --upgrade transformers scipy
    

2.  Run inference via the `Text-to-Audio` (TTA) pipeline. You can infer the MusicGen model via the TTA pipeline in just a few lines of code!

    from transformers import pipeline
    import scipy
    
    synthesiser = pipeline("text-to-audio", "facebook/musicgen-large")
    
    music = synthesiser("lo-fi music with a soothing melody", forward_params={"do_sample": True})
    
    scipy.io.wavfile.write("musicgen_out.wav", rate=music["sampling_rate"], data=music["audio"])
    

3.  Run inference via the Transformers modelling code. You can use the processor + generate code to convert text into a mono 32 kHz audio waveform for more fine-grained control.

    from transformers import AutoProcessor, MusicgenForConditionalGeneration
    
    processor = AutoProcessor.from_pretrained("facebook/musicgen-large")
    model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-large")
    
    inputs = processor(
        text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
        padding=True,
        return_tensors="pt",
    )
    
    audio_values = model.generate(**inputs, max_new_tokens=256)
    

4.  Listen to the audio samples either in an ipynb notebook:

    from IPython.display import Audio
    
    sampling_rate = model.config.audio_encoder.sampling_rate
    Audio(audio_values[0].numpy(), rate=sampling_rate)
    

Or save them as a `.wav` file using a third-party library, e.g. `scipy`:

    import scipy
    
    sampling_rate = model.config.audio_encoder.sampling_rate
    scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].numpy())
    

For more details on using the MusicGen model for inference using the  Transformers library, refer to the [MusicGen docs](https://huggingface.co/docs/transformers/model_doc/musicgen).

[](#audiocraft-usage)Audiocraft Usage
-------------------------------------

You can also run MusicGen locally through the original [Audiocraft library](/facebook/musicgen-large/blob/main/(https://github.com/facebookresearch/audiocraft):

1.  First install the [`audiocraft` library](https://github.com/facebookresearch/audiocraft)

    pip install git+https://github.com/facebookresearch/audiocraft.git
    

2.  Make sure to have [`ffmpeg`](https://ffmpeg.org/download.html) installed:

    apt get install ffmpeg
    

3.  Run the following Python code:

    from audiocraft.models import MusicGen
    from audiocraft.data.audio import audio_write
    
    model = MusicGen.get_pretrained("large")
    model.set_generation_params(duration=8)  # generate 8 seconds.
    
    descriptions = ["happy rock", "energetic EDM"]
    
    wav = model.generate(descriptions)  # generates 2 samples.
    
    for idx, one_wav in enumerate(wav):
        # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
        audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
    

[](#model-details)Model details
-------------------------------

**Organization developing the model:** The FAIR team of Meta AI.

**Model date:** MusicGen was trained between April 2023 and May 2023.

**Model version:** This is the version 1 of the model.

**Model type:** MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation.

**Paper or resources for more information:** More information can be found in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284).

**Citation details:**

    @misc{copet2023simple,
          title={Simple and Controllable Music Generation}, 
          author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Dfossez},
          year={2023},
          eprint={2306.05284},
          archivePrefix={arXiv},
          primaryClass={cs.SD}
    }
    

**License:** Code is released under MIT, model weights are released under CC-BY-NC 4.0.

**Where to send questions or comments about the model:** Questions and comments about MusicGen can be sent via the [Github repository](https://github.com/facebookresearch/audiocraft) of the project, or by opening an issue.

[](#intended-use)Intended use
-----------------------------

**Primary intended use:** The primary use of MusicGen is research on AI-based music generation, including:

*   Research efforts, such as probing and better understanding the limitations of generative models to further improve the state of science
*   Generation of music guided by text or melody to understand current abilities of generative AI models by machine learning amateurs

**Primary intended users:** The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models.

**Out-of-scope use cases:** The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.

[](#metrics)Metrics
-------------------

**Models performance measures:** We used the following objective measure to evaluate the model on a standard music benchmark:

*   Frechet Audio Distance computed on features extracted from a pre-trained audio classifier (VGGish)
*   Kullback-Leibler Divergence on label distributions extracted from a pre-trained audio classifier (PaSST)
*   CLAP Score between audio embedding and text embedding extracted from a pre-trained CLAP model

Additionally, we run qualitative studies with human participants, evaluating the performance of the model with the following axes:

*   Overall quality of the music samples;
*   Text relevance to the provided text input;
*   Adherence to the melody for melody-guided music generation.

More details on performance measures and human studies can be found in the paper.

**Decision thresholds:** Not applicable.

[](#evaluation-datasets)Evaluation datasets
-------------------------------------------

The model was evaluated on the [MusicCaps benchmark](https://www.kaggle.com/datasets/googleai/musiccaps) and on an in-domain held-out evaluation set, with no artist overlap with the training set.

[](#training-datasets)Training datasets
---------------------------------------

The model was trained on licensed data using the following sources: the [Meta Music Initiative Sound Collection](https://www.fb.com/sound), [Shutterstock music collection](https://www.shutterstock.com/music) and the [Pond5 music collection](https://www.pond5.com/). See the paper for more details about the training set and corresponding preprocessing.

[](#evaluation-results)Evaluation results
-----------------------------------------

Below are the objective metrics obtained on MusicCaps with the released model. Note that for the publicly released models, we had all the datasets go through a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs), in order to keep only the instrumental part. This explains the difference in objective metrics with the models used in the paper.

Model

Frechet Audio Distance

KLD

Text Consistency

Chroma Cosine Similarity

facebook/musicgen-small

4.88

1.42

0.27

\-

facebook/musicgen-medium

5.14

1.38

0.28

\-

**facebook/musicgen-large**

5.48

1.37

0.28

\-

facebook/musicgen-melody

4.93

1.41

0.27

0.44

More information can be found in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284), in the Results section.

[](#limitations-and-biases)Limitations and biases
-------------------------------------------------

**Data:** The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.

**Mitigations:** Vocals have been removed from the data source using corresponding tags, and then using a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs).

**Limitations:**

*   The model is not able to generate realistic vocals.
*   The model has been trained with English descriptions and will not perform as well in other languages.
*   The model does not perform equally well for all music styles and cultures.
*   The model sometimes generates end of songs, collapsing to silence.
*   It is sometimes difficult to assess what types of text descriptions provide the best generations. Prompt engineering may be required to obtain satisfying results.

**Biases:** The source of data is potentially lacking diversity and all music cultures are not equally represented in the dataset. The model may not perform equally well on the wide variety of music genres that exists. The generated samples from the model will reflect the biases from the training data. Further work on this model should include methods for balanced and just representations of cultures, for example, by scaling the training data to be both diverse and inclusive.

**Risks and harms:** Biases and limitations of the model may lead to generation of samples that may be considered as biased, inappropriate or offensive. We believe that providing the code to reproduce the research and train new models will allow to broaden the application to new and more representative data.

**Use cases:** Users must be aware of the biases, limitations and risks of the model. MusicGen is a model developed for artificial intelligence research on controllable music generation. As such, it should not be used for downstream applications without further investigation and mitigation of risks.

## Model overview

`MusicGen-large` is a text-to-music model developed by [Facebook](https://aimodels.fyi/creators/huggingFace/facebook) that can generate high-quality music samples conditioned on text descriptions or audio prompts. Unlike existing methods like [MusicLM](https://aimodels.fyi/models/huggingFace/musicgen-meta), MusicGen-large does not require a self-supervised semantic representation and generates all 4 codebooks in one pass, predicting them in parallel. This allows for faster generation at 50 auto-regressive steps per second of audio. MusicGen-large is part of a family of MusicGen models released by Facebook, including smaller and melody-focused checkpoints.

## Model inputs and outputs

MusicGen-large is a text-to-music model, taking text descriptions or audio prompts as input and generating corresponding music samples as output. The model uses a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz, allowing it to generate all the audio information in parallel.

### Inputs
- **Text descriptions**: Natural language prompts that describe the desired music
- **Audio prompts**: Existing audio samples that the generated music should be conditioned on

### Outputs
- **Music samples**: High-quality 32kHz audio waveforms representing the generated music

## Capabilities

MusicGen-large can generate a wide variety of musical styles and genres based on text or audio prompts, demonstrating impressive quality and control. The model is able to capture complex musical structures and properties like melody, harmony, and rhythm in its outputs. By generating the audio in parallel, MusicGen-large can produce 50 seconds of music per second, making it efficient for applications.

## What can I use it for?

The primary use cases for MusicGen-large are in music production and creative applications. Developers and artists could leverage the model to rapidly generate music for things like video game soundtracks, podcast jingles, or backing tracks for songs. The ability to control the music through text prompts also enables novel music composition workflows.

## Things to try

One interesting thing to try with MusicGen-large is experimenting with the level of detail and specificity in the text prompts. See how changing the prompt from a broad genre descriptor to more detailed musical attributes affects the generated output. You could also try providing audio prompts and observe how the model blends the existing music with the text description.

[](#musicgen---small---300m)MusicGen - Small - 300M
===================================================

MusicGen is a text-to-music model capable of genreating high-quality music samples conditioned on text descriptions or audio prompts. It is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods, like MusicLM, MusicGen doesn't require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict them in parallel, thus having only 50 auto-regressive steps per second of audio.

MusicGen was published in [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by _Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Dfossez_.

Four checkpoints are released:

*   [**small** (this checkpoint)](https://huggingface.co/facebook/musicgen-small)
*   [medium](https://huggingface.co/facebook/musicgen-medium)
*   [large](https://huggingface.co/facebook/musicgen-large)
*   [melody](https://huggingface.co/facebook/musicgen-melody)

[](#example)Example
-------------------

Try out MusicGen yourself!

*   Audiocraft Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fxGqfg96RBUvGxZ1XXN07s3DthrKUl4-?usp=sharing)

*   Hugging Face Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb)

*   Hugging Face Demo:

[![Open in HuggingFace](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/facebook/MusicGen)

[](#-transformers-usage) Transformers Usage
-----------------------------------------------

You can run MusicGen locally with the  Transformers library from version 4.31.0 onwards.

1.  First install the  [Transformers library](https://github.com/huggingface/transformers) and scipy:

    pip install --upgrade pip
    pip install --upgrade transformers scipy
    

2.  Run inference via the `Text-to-Audio` (TTA) pipeline. You can infer the MusicGen model via the TTA pipeline in just a few lines of code!

    from transformers import pipeline
    import scipy
    
    synthesiser = pipeline("text-to-audio", "facebook/musicgen-small")
    
    music = synthesiser("lo-fi music with a soothing melody", forward_params={"do_sample": True})
    
    scipy.io.wavfile.write("musicgen_out.wav", rate=music["sampling_rate"], data=music["audio"])
    

3.  Run inference via the Transformers modelling code. You can use the processor + generate code to convert text into a mono 32 kHz audio waveform for more fine-grained control.

    from transformers import AutoProcessor, MusicgenForConditionalGeneration
    
    processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
    model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
    
    inputs = processor(
        text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
        padding=True,
        return_tensors="pt",
    )
    
    audio_values = model.generate(**inputs, max_new_tokens=256)
    

3.  Listen to the audio samples either in an ipynb notebook:

    from IPython.display import Audio
    
    sampling_rate = model.config.audio_encoder.sampling_rate
    Audio(audio_values[0].numpy(), rate=sampling_rate)
    

Or save them as a `.wav` file using a third-party library, e.g. `scipy`:

    import scipy
    
    sampling_rate = model.config.audio_encoder.sampling_rate
    scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].numpy())
    

For more details on using the MusicGen model for inference using the  Transformers library, refer to the [MusicGen docs](https://huggingface.co/docs/transformers/model_doc/musicgen).

[](#audiocraft-usage)Audiocraft Usage
-------------------------------------

You can also run MusicGen locally through the original [Audiocraft library](/facebook/musicgen-small/blob/main/(https://github.com/facebookresearch/audiocraft):

1.  First install the [`audiocraft` library](https://github.com/facebookresearch/audiocraft)

    pip install git+https://github.com/facebookresearch/audiocraft.git
    

2.  Make sure to have [`ffmpeg`](https://ffmpeg.org/download.html) installed:

    apt-get install ffmpeg
    

3.  Run the following Python code:

    from audiocraft.models import MusicGen
    from audiocraft.data.audio import audio_write
    
    model = MusicGen.get_pretrained("small")
    model.set_generation_params(duration=8)  # generate 8 seconds.
    
    descriptions = ["happy rock", "energetic EDM"]
    
    wav = model.generate(descriptions)  # generates 2 samples.
    
    for idx, one_wav in enumerate(wav):
        # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
        audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
    

[](#model-details)Model details
-------------------------------

**Organization developing the model:** The FAIR team of Meta AI.

**Model date:** MusicGen was trained between April 2023 and May 2023.

**Model version:** This is the version 1 of the model.

**Model type:** MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation.

**Paper or resources for more information:** More information can be found in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284).

**Citation details:**

    @misc{copet2023simple,
          title={Simple and Controllable Music Generation}, 
          author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Dfossez},
          year={2023},
          eprint={2306.05284},
          archivePrefix={arXiv},
          primaryClass={cs.SD}
    }
    

**License:** Code is released under MIT, model weights are released under CC-BY-NC 4.0.

**Where to send questions or comments about the model:** Questions and comments about MusicGen can be sent via the [Github repository](https://github.com/facebookresearch/audiocraft) of the project, or by opening an issue.

[](#intended-use)Intended use
-----------------------------

**Primary intended use:** The primary use of MusicGen is research on AI-based music generation, including:

*   Research efforts, such as probing and better understanding the limitations of generative models to further improve the state of science
*   Generation of music guided by text or melody to understand current abilities of generative AI models by machine learning amateurs

**Primary intended users:** The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models.

**Out-of-scope use cases:** The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.

[](#metrics)Metrics
-------------------

**Models performance measures:** We used the following objective measure to evaluate the model on a standard music benchmark:

*   Frechet Audio Distance computed on features extracted from a pre-trained audio classifier (VGGish)
*   Kullback-Leibler Divergence on label distributions extracted from a pre-trained audio classifier (PaSST)
*   CLAP Score between audio embedding and text embedding extracted from a pre-trained CLAP model

Additionally, we run qualitative studies with human participants, evaluating the performance of the model with the following axes:

*   Overall quality of the music samples;
*   Text relevance to the provided text input;
*   Adherence to the melody for melody-guided music generation.

More details on performance measures and human studies can be found in the paper.

**Decision thresholds:** Not applicable.

[](#evaluation-datasets)Evaluation datasets
-------------------------------------------

The model was evaluated on the [MusicCaps benchmark](https://www.kaggle.com/datasets/googleai/musiccaps) and on an in-domain held-out evaluation set, with no artist overlap with the training set.

[](#training-datasets)Training datasets
---------------------------------------

The model was trained on licensed data using the following sources: the [Meta Music Initiative Sound Collection](https://www.fb.com/sound), [Shutterstock music collection](https://www.shutterstock.com/music) and the [Pond5 music collection](https://www.pond5.com/). See the paper for more details about the training set and corresponding preprocessing.

[](#evaluation-results)Evaluation results
-----------------------------------------

Below are the objective metrics obtained on MusicCaps with the released model. Note that for the publicly released models, we had all the datasets go through a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs), in order to keep only the instrumental part. This explains the difference in objective metrics with the models used in the paper.

Model

Frechet Audio Distance

KLD

Text Consistency

Chroma Cosine Similarity

**facebook/musicgen-small**

4.88

1.42

0.27

\-

facebook/musicgen-medium

5.14

1.38

0.28

\-

facebook/musicgen-large

5.48

1.37

0.28

\-

facebook/musicgen-melody

4.93

1.41

0.27

0.44

More information can be found in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284), in the Results section.

[](#limitations-and-biases)Limitations and biases
-------------------------------------------------

**Data:** The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.

**Mitigations:** Vocals have been removed from the data source using corresponding tags, and then using a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs).

**Limitations:**

*   The model is not able to generate realistic vocals.
*   The model has been trained with English descriptions and will not perform as well in other languages.
*   The model does not perform equally well for all music styles and cultures.
*   The model sometimes generates end of songs, collapsing to silence.
*   It is sometimes difficult to assess what types of text descriptions provide the best generations. Prompt engineering may be required to obtain satisfying results.

**Biases:** The source of data is potentially lacking diversity and all music cultures are not equally represented in the dataset. The model may not perform equally well on the wide variety of music genres that exists. The generated samples from the model will reflect the biases from the training data. Further work on this model should include methods for balanced and just representations of cultures, for example, by scaling the training data to be both diverse and inclusive.

**Risks and harms:** Biases and limitations of the model may lead to generation of samples that may be considered as biased, inappropriate or offensive. We believe that providing the code to reproduce the research and train new models will allow to broaden the application to new and more representative data.

**Use cases:** Users must be aware of the biases, limitations and risks of the model. MusicGen is a model developed for artificial intelligence research on controllable music generation. As such, it should not be used for downstream applications without further investigation and mitigation of risks.

## Model overview

The `musicgen-small` is a text-to-music model developed by Facebook that can generate high-quality music samples conditioned on text descriptions or audio prompts. Unlike existing methods like MusicLM, MusicGen doesn't require a self-supervised semantic representation and generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, the model can predict them in parallel, requiring only 50 auto-regressive steps per second of audio. 

MusicGen is available in different checkpoint sizes, including [medium](https://aimodels.fyi/models/huggingFace/musicgen-medium-facebook) and [large](https://aimodels.fyi/models/huggingFace/musicgen-large-facebook), as well as a [melody](https://aimodels.fyi/models/huggingFace/musicgen-melody-facebook) variant trained for melody-guided music generation. These models were published in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by researchers from Facebook.

## Model inputs and outputs

### Inputs
- **Text descriptions**: MusicGen can generate music conditioned on text prompts describing the desired style, mood, or genre.
- **Audio prompts**: The model can also be conditioned on audio inputs to guide the generation.

### Outputs
- **32kHz audio waveform**: MusicGen outputs a mono 32kHz audio waveform representing the generated music sample.

## Capabilities

MusicGen demonstrates strong capabilities in generating high-quality, controllable music from text or audio inputs. The model can create diverse musical samples across genres like rock, pop, EDM, and more, while adhering to the provided prompts.

## What can I use it for?

MusicGen is primarily intended for research on AI-based music generation, such as probing the model's limitations and exploring its potential applications. Hobbyists and amateur musicians may also find it useful for generating music guided by text or melody to better understand the current state of generative AI models.

## Things to try

You can easily run MusicGen locally using the Transformers library, which provides a simple interface for generating audio from text prompts. Try experimenting with different genres, moods, and levels of detail in your prompts to see the range of musical outputs the model can produce.

[](#fastspeech2-en-ljspeech)fastspeech2-en-ljspeech
===================================================

[FastSpeech 2](https://arxiv.org/abs/2006.04558) text-to-speech model from fairseq S^2 ([paper](https://arxiv.org/abs/2109.06912)/[code](https://github.com/pytorch/fairseq/tree/main/examples/speech_synthesis)):

*   English
*   Single-speaker female voice
*   Trained on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/)

[](#usage)Usage
---------------

    from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
    from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
    import IPython.display as ipd
    
    
    models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
        "facebook/fastspeech2-en-ljspeech",
        arg_overrides={"vocoder": "hifigan", "fp16": False}
    )
    model = models[0]
    TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
    generator = task.build_generator(model, cfg)
    
    text = "Hello, this is a test run."
    
    sample = TTSHubInterface.get_model_input(task, text)
    wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)
    
    ipd.Audio(wav, rate=rate)
    

See also [fairseq S^2 example](https://github.com/pytorch/fairseq/blob/main/examples/speech_synthesis/docs/ljspeech_example.md).

[](#citation)Citation
---------------------

    @inproceedings{wang-etal-2021-fairseq,
        title = "fairseq S{\^{}}2: A Scalable and Integrable Speech Synthesis Toolkit",
        author = "Wang, Changhan  and
          Hsu, Wei-Ning  and
          Adi, Yossi  and
          Polyak, Adam  and
          Lee, Ann  and
          Chen, Peng-Jen  and
          Gu, Jiatao  and
          Pino, Juan",
        booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
        month = nov,
        year = "2021",
        address = "Online and Punta Cana, Dominican Republic",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2021.emnlp-demo.17",
        doi = "10.18653/v1/2021.emnlp-demo.17",
        pages = "143--152",
    }

## Model overview

The `fastspeech2-en-ljspeech` model is a text-to-speech (TTS) model from Facebook's [fairseq S^2](https://arxiv.org/abs/2109.06912) project. It is a [FastSpeech 2](https://arxiv.org/abs/2006.04558) model trained on the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset, which contains a single-speaker female voice in English.

## Model inputs and outputs

### Inputs
- **Text**: The model takes in text as input, which is then converted to speech.

### Outputs
- **Audio**: The model outputs a waveform representing the synthesized speech.

## Capabilities

The `fastspeech2-en-ljspeech` model can be used to convert text to high-quality, natural-sounding speech in English. It is a non-autoregressive model, which means it can generate the entire audio output in a single pass, resulting in faster inference compared to autoregressive TTS models.

## What can I use it for?

The `fastspeech2-en-ljspeech` model can be used in a variety of applications that require text-to-speech functionality, such as audiobook generation, voice assistants, and text-based games or applications. The fast inference speed of the model makes it well-suited for real-time or streaming applications.

## Things to try

Developers can experiment with the `fastspeech2-en-ljspeech` model by integrating it into their own applications or projects. For example, they could use the model to generate audio versions of written content, or to add speech capabilities to conversational interfaces. The model's single-speaker female voice could also be used to create personalized TTS experiences.