[](#audiogen---medium---15b)AudioGen - Medium - 1.5B
====================================================

AudioGen is an autoregressive transformer LM that synthesizes general audio conditioned on text (Text-to-Audio). Internally, AudioGen operates over discrete representations learnt from the raw waveform, using an EnCodec tokenizer.

AudioGen was presented at [AudioGen: Textually Guided Audio Generation](https://arxiv.org/abs/2209.15352) by _Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Dfossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi_.

AudioGen 1.5B is a variant of the original AudioGen model that follows [MusicGen](https://arxiv.org/abs/2306.05284) architecture. More specifically, it is trained over a 16kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz with a delay pattern between the codebooks. Having only 50 auto-regressive steps per second of audio, this AudioGen model allows faster generation while reaching similar performances to the original AudioGen model introduced in the paper.

[](#audiocraft-usage)Audiocraft Usage
-------------------------------------

You can run AudioGen locally through the original [Audiocraft library](/facebook/audiogen-medium/blob/main/(https://github.com/facebookresearch/audiocraft):

1.  First install the [`audiocraft` library](https://github.com/facebookresearch/audiocraft)

    pip install git+https://github.com/facebookresearch/audiocraft.git
    

2.  Make sure to have [`ffmpeg`](https://ffmpeg.org/download.html) installed:

    apt get install ffmpeg
    

3.  Run the following Python code:

    import torchaudio
    from audiocraft.models import AudioGen
    from audiocraft.data.audio import audio_write
    
    model = AudioGen.get_pretrained('facebook/audiogen-medium')
    model.set_generation_params(duration=5)  # generate 5 seconds.
    descriptions = ['dog barking', 'sirenes of an emergency vehicule', 'footsteps in a corridor']
    wav = model.generate(descriptions)  # generates 3 samples.
    
    for idx, one_wav in enumerate(wav):
        # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
        audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
    

[](#model-details)Model details
-------------------------------

See [AudioGen's model card](https://github.com/facebookresearch/audiocraft/blob/main/model_cards/AUDIOGEN_MODEL_CARD.md).

## Model overview

`audiogen-medium` is a text-to-audio model developed by the Facebook AI Research (FAIR) team. It is an autoregressive transformer language model that can generate general audio content conditioned on text prompts. Unlike existing speech synthesis models, `audiogen-medium` operates over discrete audio representations learned from raw waveforms using an EnCodec tokenizer. This allows the model to generate more diverse and flexible audio outputs compared to traditional text-to-speech systems. `audiogen-medium` is a variant of the original AudioGen model that follows the [MusicGen](https://aimodels.fyi/models/huggingFace/musicgen-medium-facebook) architecture, using a 16kHz EnCodec tokenizer with a delay pattern between the codebooks to enable faster generation.

## Model inputs and outputs

### Inputs
- **Text prompts**: The model takes in text descriptions as input, which are used to condition the audio generation.

### Outputs
- **Audio waveforms**: The model outputs general audio content in the form of raw waveforms, which can represent a variety of sounds beyond just speech.

## Capabilities

`audiogen-medium` can generate diverse audio samples ranging from environmental sounds like dog barks or emergency vehicle sirens, to more abstract soundscapes. The model is able to capture timbres, textures and dynamics in the generated audio that go beyond what traditional text-to-speech models can achieve. However, the model is not capable of generating realistic vocals or speech.

## What can I use it for?

The primary intended use of `audiogen-medium` is for research on AI-based audio generation, including efforts to better understand the capabilities and limitations of generative models. The model could also be used by machine learning enthusiasts and hobbyists to explore the potential of text-to-audio synthesis. However, the model should not be used for downstream applications without further evaluation and mitigation of potential biases and risks.

## Things to try

One interesting aspect of `audiogen-medium` is its ability to generate audio content that is responsive to the provided text prompts, while still exhibiting a degree of creative diversity. Experimenting with different types of text descriptions, from concrete to abstract, can yield a wide range of audio outputs that capture different moods, textures and soundscapes. Additionally, comparing the performance of `audiogen-medium` to that of similar models like [MusicGen](https://aimodels.fyi/models/huggingFace/musicgen-medium-facebook) could provide insights into the strengths and limitations of each approach to text-conditional audio generation.