tango
declare-lab
Tango is a latent diffusion model (LDM) for text-to-audio (TTA) generation, capable of generating realistic audios including human sounds, animal sounds, natural and artificial sounds, and sound effects from textual prompts. It uses the frozen instruction-tuned language model Flan-T5 as the text encoder and trains a UNet-based diffusion model for audio generation. Compared to current state-of-the-art TTA models, Tango performs comparably across both objective and subjective metrics, despite training on a dataset 63 times smaller. The maintainer has released the model, training, and inference code for the research community.
Tango 2 is a follow-up to Tango, built upon the same foundation but with additional alignment training using Direct Preference Optimization (DPO) on the Audio-alpaca dataset, a pairwise text-to-audio preference dataset. This helps Tango 2 generate higher-quality and more aligned audio outputs.
Model inputs and outputs
Inputs
Prompt**: A textual description of the desired audio to be generated.
Steps**: The number of steps to use for the diffusion-based audio generation process, with more steps typically producing higher-quality results at the cost of longer inference time.
Guidance**: The guidance scale, which controls the trade-off between sample quality and sample diversity during the audio generation process.
Outputs
Audio**: The generated audio clip corresponding to the input prompt, in WAV format.
Capabilities
Tango and Tango 2 can generate a wide variety of realistic audio clips, including human sounds, animal sounds, natural and artificial sounds, and sound effects. For example, they can generate sounds of an audience cheering and clapping, rolling thunder with lightning strikes, or a car engine revving.
What can I use it for?
The Tango and Tango 2 models can be used for a variety of applications, such as:
Audio content creation**: Generating audio clips for videos, games, podcasts, and other multimedia projects.
Sound design**: Creating custom sound effects for various applications.
Music composition**: Generating musical elements or accompaniment for songwriting and composition.
Accessibility**: Generating audio descriptions for visually impaired users.
Things to try
You can try generating various types of audio clips by providing different prompts to the Tango and Tango 2 models, such as:
Everyday sounds (e.g., a dog barking, water flowing, a car engine revving)
Natural phenomena (e.g., thunderstorms, wind, rain)
Musical instruments and soundscapes (e.g., a piano playing, a symphony orchestra)
Human vocalizations (e.g., laughter, cheering, singing)
Ambient and abstract sounds (e.g., a futuristic machine, alien landscapes)
Experiment with the number of steps and guidance scale to find the right balance between sample quality and generation time for your specific use case.
Read more