The model, called "riffusion", is designed for real-time music generation using a stable diffusion process. It takes input in the form of a MIDI file and generates coherent and continuous output in real-time. The model incorporates a diffusion process that ensures stability and smooth transitions between musical segments. This makes it ideal for applications such as live music performances or interactive music generation systems.

The Riffusion model is a latent text-to-image diffusion model that can generate spectrogram images based on text inputs. These spectrograms can then be converted into audio clips. The model was created by Seth Forsgren and Hayk Martiros as a hobby project and is based on the Stable-Diffusion-v1-5 checkpoint. It uses a fixed, pretrained text encoder (CLIP ViT-L/14) for generating the images. The model is licensed under the CreativeML OpenRAIL M license and is intended for research purposes, particularly in the areas of artwork generation, creative tools, and generative models. The original training data used LAION-5B dataset and the CLIP text encoder. Fine-tuning the model requires a dataset of spectrogram images with corresponding text descriptions.

