Stable diffusion for real-time music generation

## Model overview

`riffusion` is a library for real-time music and audio generation using the [Stable Diffusion](https://aimodels.fyi/models/replicate/stable-diffusion-stability-ai) text-to-image diffusion model. It was developed by [Seth Forsgren](https://sethforsgren.com/) and [Hayk Martiros](https://haykmartiros.com/) as a hobby project. `riffusion` fine-tunes Stable Diffusion to generate spectrogram images that can be converted into audio clips, allowing for the creation of music based on text prompts. This is in contrast to other similar models like [inkpunk-diffusion](https://aimodels.fyi/models/replicate/inkpunk-diffusion-adithram) and [multidiffusion](https://aimodels.fyi/models/replicate/multidiffusion-omerbt) which focus on visual art generation.

## Model inputs and outputs

`riffusion` takes in a text prompt, an optional second prompt for interpolation, a seed image ID, and parameters controlling the diffusion process. It outputs a spectrogram image and the corresponding audio clip. 

### Inputs
- **Prompt A**: The primary text prompt describing the desired audio
- **Prompt B**: An optional second prompt to interpolate with the first
- **Alpha**: The interpolation value between the two prompts, from 0 to 1
- **Denoising**: How much to transform the input spectrogram, from 0 to 1
- **Seed Image ID**: The ID of a seed spectrogram image to use
- **Num Inference Steps**: The number of steps to run the diffusion model

### Outputs
- **Spectrogram Image**: A spectrogram visualization of the generated audio
- **Audio Clip**: The generated audio clip in MP3 format

## Capabilities

`riffusion` can generate a wide variety of musical styles and genres based on the provided text prompts. For example, it can create "funky synth solos", "jazz with piano", or "church bells on Sunday". The model is able to capture complex musical concepts and translate them into coherent audio clips. 

## What can I use it for?

The `riffusion` model is intended for research and creative applications. It could be used to generate audio for educational or creative tools, or as part of artistic projects exploring the intersection of language and music. Additionally, researchers studying generative models and the connection between text and audio may find `riffusion` useful for their work.

## Things to try

One interesting aspect of `riffusion` is its ability to interpolate between two text prompts. By adjusting the `alpha` parameter, you can create a smooth transition from one style of music to another, allowing for the generation of unique and unexpected audio clips. Another interesting area to explore is the model's handling of seed images - by providing different starting spectrograms, you can influence the character and direction of the generated music.

[](#riffusion)Riffusion
=======================

Riffusion is an app for real-time music generation with stable diffusion.

Read about it at [https://www.riffusion.com/about](https://www.riffusion.com/about) and try it at [https://www.riffusion.com/](https://www.riffusion.com/).

*   Code: [https://github.com/riffusion/riffusion](https://github.com/riffusion/riffusion)
*   Web app: [https://github.com/hmartiro/riffusion-app](https://github.com/hmartiro/riffusion-app)
*   Model checkpoint: [https://huggingface.co/riffusion/riffusion-model-v1](https://huggingface.co/riffusion/riffusion-model-v1)
*   Discord: [https://discord.gg/yu6SRwvX4v](https://discord.gg/yu6SRwvX4v)

This repository contains the model files, including:

*   a diffusers formated library
*   a compiled checkpoint file
*   a traced unet for improved inference speed
*   a seed image library for use with riffusion-app

[](#riffusion-v1-model)Riffusion v1 Model
-----------------------------------------

Riffusion is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips.

The model was created by [Seth Forsgren](https://sethforsgren.com/) and [Hayk Martiros](https://haykmartiros.com/) as a hobby project.

You can use the Riffusion model directly, or try the [Riffusion web app](https://www.riffusion.com/).

The Riffusion model was created by fine-tuning the **Stable-Diffusion-v1-5** checkpoint. Read about Stable Diffusion here ['s Stable Diffusion blog](https://huggingface.co/blog/stable_diffusion).

### [](#model-details)Model Details

*   **Developed by:** Seth Forsgren, Hayk Martiros
*   **Model type:** Diffusion-based text-to-image generation model
*   **Language(s):** English
*   **License:** [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based.
*   **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([CLIP ViT-L/14](https://arxiv.org/abs/2103.00020)) as suggested in the [Imagen paper](https://arxiv.org/abs/2205.11487).

### [](#direct-use)Direct Use

The model is intended for research purposes only. Possible research areas and tasks include

*   Generation of artworks, audio, and use in creative processes.
*   Applications in educational or creative tools.
*   Research on generative models.

### [](#datasets)Datasets

The original Stable Diffusion v1.5 was trained on the [LAION-5B](https://arxiv.org/abs/2210.08402) dataset using the [CLIP text encoder](https://openai.com/blog/clip/), which provided an amazing starting point with an in-depth understanding of language, including musical concepts. The team at LAION also compiled a fantastic audio dataset from many general, speech, and music sources that we recommend at [LAION-AI/audio-dataset](https://github.com/LAION-AI/audio-dataset/blob/main/data_collection/README.md).

### [](#fine-tuning)Fine Tuning

Check out the [diffusers training examples](https://huggingface.co/docs/diffusers/training/overview) from Hugging Face. Fine tuning requires a dataset of spectrogram images of short audio clips, with associated text describing them. Note that the CLIP encoder is able to understand and connect many words even if they never appear in the dataset. It is also possible to use a [dreambooth](https://huggingface.co/blog/dreambooth) method to get custom styles.

[](#citation)Citation
---------------------

If you build on this work, please cite it as follows:

    @article{Forsgren_Martiros_2022,
      author = {Forsgren, Seth* and Martiros, Hayk*},
      title = {{Riffusion - Stable diffusion for real-time music generation}},
      url = {https://riffusion.com/about},
      year = {2022}
    }

## Model overview

`riffusion-model-v1` is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips. The model was created by fine-tuning the [Stable Diffusion](https://aimodels.fyi/models/huggingFace/stable-diffusion-compvis) checkpoint. The Riffusion model was developed by [Seth Forsgren](https://sethforsgren.com/) and [Hayk Martiros](https://haykmartiros.com/) as a hobby project.

## Model inputs and outputs

The `riffusion-model-v1` takes text prompts as input and generates spectrogram images as output. These spectrograms can then be converted into audio clips.

### Inputs
- **Text prompt**: Any text input that describes the desired audio clip.

### Outputs
- **Spectrogram image**: An image containing a visual representation of the audio signal's frequency content over time.

## Capabilities

The `riffusion-model-v1` is capable of generating a wide variety of audio content based on text prompts, from musical melodies to sound effects. By leveraging the capabilities of Stable Diffusion, the model can create unique and creative audio outputs that align with the provided text input.

## What can I use it for?

The `riffusion-model-v1` model is intended for research purposes only. Possible use cases include the generation of artistic audio content, exploration of the limitations and biases of generative audio models, and the development of educational or creative tools. The model should not be used to intentionally create or disseminate harmful or offensive content.

## Things to try

Experiment with different text prompts to see the variety of audio outputs the `riffusion-model-v1` can generate. Try prompts that describe specific genres, instruments, or sound effects to see how the model responds. Additionally, you can explore the model's capabilities by combining text prompts with the [Riffusion web app](https://www.riffusion.com/) to create interactive audio experiences.