Text-to-audio generation with latent diffusion models

## Model overview

`audio-ldm` is a text-to-audio generation model created by Haohe Liu, a researcher at CVSSP. It uses latent diffusion models to generate audio based on text prompts. The model is similar to [stable-diffusion](https://aimodels.fyi/models/replicate/stable-diffusion-stability-ai), a widely-used latent text-to-image diffusion model, but applied to the audio domain. It is also related to models like [riffusion](https://aimodels.fyi/models/replicate/riffusion-riffusion), which generates music from text, and [whisperx](https://aimodels.fyi/models/replicate/whisperx-daanelson), which transcribes audio. However, `audio-ldm` is focused specifically on generating a wide range of audio content from text.

## Model inputs and outputs

The `audio-ldm` model takes in a text prompt as input and generates an audio clip as output. The text prompt can describe the desired sound, such as "a hammer hitting a wooden surface" or "children singing". The model then produces an audio clip that matches the text prompt.

### Inputs
- **Text**: A text prompt describing the desired audio to generate.
- **Duration**: The duration of the generated audio clip in seconds. Higher durations may lead to out-of-memory errors.
- **Random Seed**: An optional random seed to control the randomness of the generation.
- **N Candidates**: The number of candidate audio clips to generate, with the best one selected.
- **Guidance Scale**: A parameter that controls the balance between audio quality and diversity. Higher values lead to better quality but less diversity.

### Outputs
- **Audio Clip**: The generated audio clip that matches the input text prompt.

## Capabilities

`audio-ldm` is capable of generating a wide variety of audio content from text prompts, including speech, sound effects, music, and beyond. It can also perform audio-to-audio generation, where it generates a new audio clip that has similar sound events to a provided input audio. Additionally, the model supports text-guided audio-to-audio style transfer, where it can transfer the sound of an input audio clip to match a text description.

## What can I use it for?

`audio-ldm` could be useful for various applications, such as:

- **Creative content generation**: Generating audio content for use in videos, games, or other multimedia projects.
- **Audio post-production**: Automating the creation of sound effects or music to complement visual content.
- **Accessibility**: Generating audio descriptions for visually impaired users.
- **Education and research**: Exploring the capabilities of text-to-audio generation models.

## Things to try

When using `audio-ldm`, try providing more detailed and descriptive text prompts to get better quality results. Experiment with different random seeds to see how they affect the generation. You can also try combining `audio-ldm` with other audio tools and techniques, such as audio editing or signal processing, to create even more interesting and compelling audio content.