[](#sd-xl-10-base-model-card)SD-XL 1.0-base Model Card
======================================================

[![row01](/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/01.png)](/stabilityai/stable-diffusion-xl-base-1.0/blob/main/01.png)

[](#model)Model
---------------

[![pipeline](/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/pipeline.png)](/stabilityai/stable-diffusion-xl-base-1.0/blob/main/pipeline.png)

[SDXL](https://arxiv.org/abs/2307.01952) consists of an [ensemble of experts](https://arxiv.org/abs/2211.01324) pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: [https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/)) specialized for the final denoising steps. Note that the base model can be used as a standalone module.

Alternatively, we can use a two-stage pipeline as follows: First, the base model is used to generate latents of the desired output size. In the second step, we use a specialized high-resolution model and apply a technique called SDEdit ([https://arxiv.org/abs/2108.01073](https://arxiv.org/abs/2108.01073), also known as "img2img") to the latents generated in the first step, using the same prompt. This technique is slightly slower than the first one, as it requires more function evaluations.

Source code is available at [https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models) .

### [](#model-description)Model Description

*   **Developed by:** Stability AI
*   **Model type:** Diffusion-based text-to-image generative model
*   **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/LICENSE.md)
*   **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses two fixed, pretrained text encoders ([OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main)).
*   **Resources for more information:** Check out our [GitHub Repository](https://github.com/Stability-AI/generative-models) and the [SDXL report on arXiv](https://arxiv.org/abs/2307.01952).

### [](#model-sources)Model Sources

For research purposes, we recommend our `generative-models` Github repository ([https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)), which implements the most popular diffusion frameworks (both training and inference) and for which new functionalities like distillation will be added over time. [Clipdrop](https://clipdrop.co/stable-diffusion) provides free SDXL inference.

*   **Repository:** [https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)
*   **Demo:** [https://clipdrop.co/stable-diffusion](https://clipdrop.co/stable-diffusion)

[](#evaluation)Evaluation
-------------------------

[![comparison](/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/comparison.png)](/stabilityai/stable-diffusion-xl-base-1.0/blob/main/comparison.png) The chart above evaluates user preference for SDXL (with and without refinement) over SDXL 0.9 and Stable Diffusion 1.5 and 2.1. The SDXL base model performs significantly better than the previous variants, and the model combined with the refinement module achieves the best overall performance.

### [](#-diffusers) Diffusers

Make sure to upgrade diffusers to >= 0.19.0:

    pip install diffusers --upgrade
    

In addition make sure to install `transformers`, `safetensors`, `accelerate` as well as the invisible watermark:

    pip install invisible_watermark transformers accelerate safetensors
    

To just use the base model, you can run:

    from diffusers import DiffusionPipeline
    import torch
    
    pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
    pipe.to("cuda")
    
    # if using torch < 2.0
    # pipe.enable_xformers_memory_efficient_attention()
    
    prompt = "An astronaut riding a green horse"
    
    images = pipe(prompt=prompt).images[0]
    

To use the whole base + refiner pipeline as an ensemble of experts you can run:

    from diffusers import DiffusionPipeline
    import torch
    
    # load both base & refiner
    base = DiffusionPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
    )
    base.to("cuda")
    refiner = DiffusionPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-refiner-1.0",
        text_encoder_2=base.text_encoder_2,
        vae=base.vae,
        torch_dtype=torch.float16,
        use_safetensors=True,
        variant="fp16",
    )
    refiner.to("cuda")
    
    # Define how many steps and what % of steps to be run on each experts (80/20) here
    n_steps = 40
    high_noise_frac = 0.8
    
    prompt = "A majestic lion jumping from a big stone at night"
    
    # run both experts
    image = base(
        prompt=prompt,
        num_inference_steps=n_steps,
        denoising_end=high_noise_frac,
        output_type="latent",
    ).images
    image = refiner(
        prompt=prompt,
        num_inference_steps=n_steps,
        denoising_start=high_noise_frac,
        image=image,
    ).images[0]
    

When using `torch >= 2.0`, you can improve the inference speed by 20-30% with torch.compile. Simple wrap the unet with torch compile before running the pipeline:

    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
    

If you are limited by GPU VRAM, you can enable _cpu offloading_ by calling `pipe.enable_model_cpu_offload` instead of `.to("cuda")`:

    - pipe.to("cuda")
    + pipe.enable_model_cpu_offload()
    

For more information on how to use Stable Diffusion XL with `diffusers`, please have a look at [the Stable Diffusion XL Docs](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl).

### [](#optimum)Optimum

[Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with both [OpenVINO](https://docs.openvino.ai/latest/index.html) and [ONNX Runtime](https://onnxruntime.ai/).

#### [](#openvino)OpenVINO

To install Optimum with the dependencies required for OpenVINO :

    pip install optimum[openvino]
    

To load an OpenVINO model and run inference with OpenVINO Runtime, you need to replace `StableDiffusionXLPipeline` with Optimum `OVStableDiffusionXLPipeline`. In case you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, you can set `export=True`.

    - from diffusers import StableDiffusionXLPipeline
    + from optimum.intel import OVStableDiffusionXLPipeline
    
    model_id = "stabilityai/stable-diffusion-xl-base-1.0"
    - pipeline = StableDiffusionXLPipeline.from_pretrained(model_id)
    + pipeline = OVStableDiffusionXLPipeline.from_pretrained(model_id)
    prompt = "A majestic lion jumping from a big stone at night"
    image = pipeline(prompt).images[0]
    

You can find more examples (such as static reshaping and model compilation) in optimum [documentation](https://huggingface.co/docs/optimum/main/en/intel/inference#stable-diffusion-xl).

#### [](#onnx)ONNX

To install Optimum with the dependencies required for ONNX Runtime inference :

    pip install optimum[onnxruntime]
    

To load an ONNX model and run inference with ONNX Runtime, you need to replace `StableDiffusionXLPipeline` with Optimum `ORTStableDiffusionXLPipeline`. In case you want to load a PyTorch model and convert it to the ONNX format on-the-fly, you can set `export=True`.

    - from diffusers import StableDiffusionXLPipeline
    + from optimum.onnxruntime import ORTStableDiffusionXLPipeline
    
    model_id = "stabilityai/stable-diffusion-xl-base-1.0"
    - pipeline = StableDiffusionXLPipeline.from_pretrained(model_id)
    + pipeline = ORTStableDiffusionXLPipeline.from_pretrained(model_id)
    prompt = "A majestic lion jumping from a big stone at night"
    image = pipeline(prompt).images[0]
    

You can find more examples in optimum [documentation](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#stable-diffusion-xl).

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

The model is intended for research purposes only. Possible research areas and tasks include

*   Generation of artworks and use in design and other artistic processes.
*   Applications in educational or creative tools.
*   Research on generative models.
*   Safe deployment of models which have the potential to generate harmful content.
*   Probing and understanding the limitations and biases of generative models.

Excluded uses are described below.

### [](#out-of-scope-use)Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

[](#limitations-and-bias)Limitations and Bias
---------------------------------------------

### [](#limitations)Limitations

*   The model does not achieve perfect photorealism
*   The model cannot render legible text
*   The model struggles with more difficult tasks which involve compositionality, such as rendering an image corresponding to A red cube on top of a blue sphere
*   Faces and people in general may not be generated properly.
*   The autoencoding part of the model is lossy.

### [](#bias)Bias

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

## Model overview

The `stable-diffusion-xl-base-1.0` model is a text-to-image generative AI model developed by [Stability AI](https://aimodels.fyi/creators/huggingFace/stabilityai). It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses two fixed, pretrained text encoders ([OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main)). The model is an ensemble of experts pipeline, where the base model generates latents that are then further processed by a specialized refinement model. Alternatively, the base model can be used on its own to generate latents, which can then be processed using a high-resolution model and the SDEdit technique for image-to-image generation.

Similar models include the [stable-diffusion-xl-refiner-1.0](https://aimodels.fyi/models/huggingFace/stable-diffusion-xl-refiner-10-stabilityai) and [stable-diffusion-xl-refiner-0.9](https://aimodels.fyi/models/huggingFace/stable-diffusion-xl-refiner-09-stabilityai) models, which serve as the refinement modules for the base `stable-diffusion-xl-base-1.0` model.

## Model inputs and outputs

### Inputs
- **Text prompt**: A natural language description of the desired image to generate.

### Outputs
- **Generated image**: An image generated from the input text prompt.

## Capabilities

The `stable-diffusion-xl-base-1.0` model can generate a wide variety of images based on text prompts, ranging from photorealistic scenes to more abstract and stylized imagery. The model performs particularly well on tasks like generating artworks, fantasy scenes, and conceptual designs. However, it struggles with more complex tasks involving compositionality, such as rendering an image of a red cube on top of a blue sphere.

## What can I use it for?

The `stable-diffusion-xl-base-1.0` model is intended for research purposes, such as:

- Generation of artworks and use in design and other artistic processes.
- Applications in educational or creative tools.
- Research on generative models and their limitations and biases.
- Safe deployment of models with the potential to generate harmful content.

For commercial use, Stability AI provides a membership program, as detailed on their [website](https://stability.ai/membership).

## Things to try

One interesting aspect of the `stable-diffusion-xl-base-1.0` model is its ability to generate high-quality images with relatively few inference steps. By using the specialized refinement model or the SDEdit technique, users can achieve impressive results with a more efficient inference process. Additionally, the model's performance can be further optimized by utilizing techniques like CPU offloading or torch.compile, as mentioned in the provided documentation.

[](#stable-diffusion-v2-1-model-card)Stable Diffusion v2-1 Model Card
=====================================================================

This model card focuses on the model associated with the Stable Diffusion v2-1 model, codebase available [here](https://github.com/Stability-AI/stablediffusion).

This `stable-diffusion-2-1` model is fine-tuned from [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) (`768-v-ema.ckpt`) with an additional 55k steps on the same dataset (with `punsafe=0.1`), and then fine-tuned for another 155k extra steps with `punsafe=0.98`.

*   Use it with the [`stablediffusion`](https://github.com/Stability-AI/stablediffusion) repository: download the `v2-1_768-ema-pruned.ckpt` [here](https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/v2-1_768-ema-pruned.ckpt).
*   Use it with  [`diffusers`](#examples)

[](#model-details)Model Details
-------------------------------

*   **Developed by:** Robin Rombach, Patrick Esser
    
*   **Model type:** Diffusion-based text-to-image generation model
    
*   **Language(s):** English
    
*   **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-2/blob/main/LICENSE-MODEL)
    
*   **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip)).
    
*   **Resources for more information:** [GitHub Repository](https://github.com/Stability-AI/).
    
*   **Cite as:**
    
        @InProceedings{Rombach_2022_CVPR,
            author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
            title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
            booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            month     = {June},
            year      = {2022},
            pages     = {10684-10695}
        }
        
    

[](#examples)Examples
---------------------

Using the ['s Diffusers library](https://github.com/huggingface/diffusers) to run Stable Diffusion 2 in a simple and efficient manner.

    pip install diffusers transformers accelerate scipy safetensors
    

Running the pipeline (if you don't swap the scheduler it will run with the default DDIM, in this example we are swapping it to DPMSolverMultistepScheduler):

    import torch
    from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
    
    model_id = "stabilityai/stable-diffusion-2-1"
    
    # Use the DPMSolverMultistepScheduler (DPM-Solver++) scheduler here instead
    pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
    pipe = pipe.to("cuda")
    
    prompt = "a photo of an astronaut riding a horse on mars"
    image = pipe(prompt).images[0]
        
    image.save("astronaut_rides_horse.png")
    

**Notes**:

*   Despite not being a dependency, we highly recommend you to install [xformers](https://github.com/facebookresearch/xformers) for memory efficient attention (better performance)
*   If you have low GPU RAM available, make sure to add a `pipe.enable_attention_slicing()` after sending it to `cuda` for less VRAM usage (to the cost of speed)

[](#uses)Uses
=============

[](#direct-use)Direct Use
-------------------------

The model is intended for research purposes only. Possible research areas and tasks include

*   Safe deployment of models which have the potential to generate harmful content.
*   Probing and understanding the limitations and biases of generative models.
*   Generation of artworks and use in design and other artistic processes.
*   Applications in educational or creative tools.
*   Research on generative models.

Excluded uses are described below.

### [](#misuse-malicious-use-and-out-of-scope-use)Misuse, Malicious Use, and Out-of-Scope Use

_Note: This section is originally taken from the [DALLE-MINI model card](https://huggingface.co/dalle-mini/dalle-mini), was used for Stable Diffusion v1, but applies in the same way to Stable Diffusion v2_.

The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.

#### [](#out-of-scope-use)Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

#### [](#misuse-and-malicious-use)Misuse and Malicious Use

Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:

*   Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
*   Intentionally promoting or propagating discriminatory content or harmful stereotypes.
*   Impersonating individuals without their consent.
*   Sexual content without consent of the people who might see it.
*   Mis- and disinformation
*   Representations of egregious violence and gore
*   Sharing of copyrighted or licensed material in violation of its terms of use.
*   Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.

[](#limitations-and-bias)Limitations and Bias
---------------------------------------------

### [](#limitations)Limitations

*   The model does not achieve perfect photorealism
*   The model cannot render legible text
*   The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to A red cube on top of a blue sphere
*   Faces and people in general may not be generated properly.
*   The model was trained mainly with English captions and will not work as well in other languages.
*   The autoencoding part of the model is lossy
*   The model was trained on a subset of the large-scale dataset [LAION-5B](https://laion.ai/blog/laion-5b/), which contains adult, violent and sexual content. To partially mitigate this, we have filtered the dataset using LAION's NFSW detector (see Training section).

### [](#bias)Bias

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion was primarily trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/), which consists of images that are limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts. Stable Diffusion v2 mirrors and exacerbates biases to such a degree that viewer discretion must be advised irrespective of the input or its intent.

[](#training)Training
---------------------

**Training Data** The model developers used the following dataset for training the model:

*   LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector, with a "p\_unsafe" score of 0.1 (conservative). For more details, please refer to LAION-5B's [NeurIPS 2022](https://openreview.net/forum?id=M3Y74vmsMcY) paper and reviewer discussions on the topic.

**Training Procedure** Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

*   Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
*   Text prompts are encoded through the OpenCLIP-ViT/H text-encoder.
*   The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
*   The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called _v-objective_, see [https://arxiv.org/abs/2202.00512](https://arxiv.org/abs/2202.00512).

We currently provide the following checkpoints:

*   `512-base-ema.ckpt`: 550k steps at resolution `256x256` on a subset of [LAION-5B](https://laion.ai/blog/laion-5b/) filtered for explicit pornographic material, using the [LAION-NSFW classifier](https://github.com/LAION-AI/CLIP-based-NSFW-Detector) with `punsafe=0.1` and an [aesthetic score](https://github.com/christophschuhmann/improved-aesthetic-predictor) >= `4.5`. 850k steps at resolution `512x512` on the same dataset with resolution `>= 512x512`.
    
*   `768-v-ema.ckpt`: Resumed from `512-base-ema.ckpt` and trained for 150k steps using a [v-objective](https://arxiv.org/abs/2202.00512) on the same dataset. Resumed for another 140k steps on a `768x768` subset of our dataset.
    
*   `512-depth-ema.ckpt`: Resumed from `512-base-ema.ckpt` and finetuned for 200k steps. Added an extra input channel to process the (relative) depth prediction produced by [MiDaS](https://github.com/isl-org/MiDaS) (`dpt_hybrid`) which is used as an additional conditioning. The additional input channels of the U-Net which process this extra information were zero-initialized.
    
*   `512-inpainting-ema.ckpt`: Resumed from `512-base-ema.ckpt` and trained for another 200k steps. Follows the mask-generation strategy presented in [LAMA](https://github.com/saic-mdal/lama) which, in combination with the latent VAE representations of the masked image, are used as an additional conditioning. The additional input channels of the U-Net which process this extra information were zero-initialized. The same strategy was used to train the [1.5-inpainting checkpoint](https://huggingface.co/runwayml/stable-diffusion-inpainting).
    
*   `x4-upscaling-ema.ckpt`: Trained for 1.25M steps on a 10M subset of LAION containing images `>2048x2048`. The model was trained on crops of size `512x512` and is a text-guided [latent upscaling diffusion model](https://arxiv.org/abs/2112.10752). In addition to the textual input, it receives a `noise_level` as an input parameter, which can be used to add noise to the low-resolution input according to a [predefined diffusion schedule](/stabilityai/stable-diffusion-2-1/blob/main/configs/stable-diffusion/x4-upscaling.yaml).
    
*   **Hardware:** 32 x 8 x A100 GPUs
    
*   **Optimizer:** AdamW
    
*   **Gradient Accumulations**: 1
    
*   **Batch:** 32 x 8 x 2 x 4 = 2048
    
*   **Learning rate:** warmup to 0.0001 for 10,000 steps and then kept constant
    

[](#evaluation-results)Evaluation Results
-----------------------------------------

Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 steps DDIM sampling steps show the relative improvements of the checkpoints:

[![pareto](/stabilityai/stable-diffusion-2-1/resolve/main/model-variants.jpg)](/stabilityai/stable-diffusion-2-1/blob/main/model-variants.jpg)

Evaluated using 50 DDIM steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution. Not optimized for FID scores.

[](#environmental-impact)Environmental Impact
---------------------------------------------

**Stable Diffusion v1** **Estimated Emissions** Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

*   **Hardware Type:** A100 PCIe 40GB
*   **Hours used:** 200000
*   **Cloud Provider:** AWS
*   **Compute Region:** US-east
*   **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 15000 kg CO2 eq.

[](#citation)Citation
---------------------

    @InProceedings{Rombach_2022_CVPR,
        author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
        title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2022},
        pages     = {10684-10695}
    }
    

_This model card was written by: Robin Rombach, Patrick Esser and David Ha and is based on the [Stable Diffusion v1](https://github.com/CompVis/stable-diffusion/blob/main/Stable_Diffusion_v1_Model_Card.md) and [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini)._

## Model overview

The `stable-diffusion-2-1` model is a text-to-image generation model developed by Stability AI. It is a fine-tuned version of the [stable-diffusion-2](https://aimodels.fyi/models/huggingFace/stable-diffusion-2-stabilityai) model, with an additional 55k steps on the same dataset and then a further 155k steps with adjusted "unsafety" settings. Similar models include the [stable-diffusion-2-1-base](https://aimodels.fyi/models/huggingFace/stable-diffusion-2-1-base-stabilityai) which fine-tunes the [stable-diffusion-2-base](https://aimodels.fyi/models/huggingFace/stable-diffusion-2-base-stabilityai) model.

## Model inputs and outputs

The `stable-diffusion-2-1` model is a diffusion-based text-to-image generation model that takes text prompts as input and generates corresponding images as output. The text prompts are encoded using a fixed, pre-trained text encoder, and the generated images are 768x768 pixels in size.

### Inputs
- **Text prompt**: A natural language description of the desired image.

### Outputs
- **Image**: A 768x768 pixel image generated based on the input text prompt.

## Capabilities

The `stable-diffusion-2-1` model can generate a wide variety of images based on text prompts, from realistic scenes to fantastical creations. It demonstrates impressive capabilities in areas like generating detailed and complex images, rendering different styles and artistic mediums, and combining diverse visual elements. However, the model still has limitations in terms of generating fully photorealistic images, rendering legible text, and handling more complex compositional tasks.

## What can I use it for?

The `stable-diffusion-2-1` model is intended for research purposes only. Possible use cases include generating artworks and designs, creating educational or creative tools, and probing the limitations and biases of generative models. The model should not be used to intentionally create or disseminate images that could be harmful, offensive, or propagate stereotypes.

## Things to try

One interesting aspect of the `stable-diffusion-2-1` model is its ability to generate images with different styles and artistic mediums based on the text prompt. For example, you could try prompts that combine realistic elements with more fantastical or stylized components, or experiment with prompts that evoke specific artistic movements or genres. The model's performance may also vary depending on the language and cultural context of the prompt, so exploring prompts in different languages could yield interesting results.

[](#stable-video-diffusion-image-to-video-model-card)Stable Video Diffusion Image-to-Video Model Card
=====================================================================================================

[![row01](/stabilityai/stable-video-diffusion-img2vid-xt/resolve/main/output_tile.gif)](/stabilityai/stable-video-diffusion-img2vid-xt/blob/main/output_tile.gif) Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.

Please note: For commercial use, please refer to [https://stability.ai/membership](https://stability.ai/membership).

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

(SVD) Image-to-Video is a latent diffusion model trained to generate short video clips from an image conditioning. This model was trained to generate 25 frames at resolution 576x1024 given a context frame of the same size, finetuned from [SVD Image-to-Video \[14 frames\]](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid). We also finetune the widely used [f8-decoder](https://huggingface.co/docs/diffusers/api/models/autoencoderkl#loading-from-the-original-format) for temporal consistency. For convenience, we additionally provide the model with the standard frame-wise decoder [here](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/blob/main/svd_xt_image_decoder.safetensors).

*   **Developed by:** Stability AI
*   **Funded by:** Stability AI
*   **Model type:** Generative image-to-video model
*   **Finetuned from model:** SVD Image-to-Video \[14 frames\]

### [](#model-sources)Model Sources

For research purposes, we recommend our `generative-models` Github repository ([https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)), which implements the most popular diffusion frameworks (both training and inference).

*   **Repository:** [https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)
*   **Paper:** [https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets](https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets)

[](#evaluation)Evaluation
-------------------------

[![comparison](/stabilityai/stable-video-diffusion-img2vid-xt/resolve/main/comparison.png)](/stabilityai/stable-video-diffusion-img2vid-xt/blob/main/comparison.png) The chart above evaluates user preference for SVD-Image-to-Video over [GEN-2](https://research.runwayml.com/gen2) and [PikaLabs](https://www.pika.art/). SVD-Image-to-Video is preferred by human voters in terms of video quality. For details on the user study, we refer to the [research paper](https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets)

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

The model is intended for both non-commercial and commercial usage. You can use this model for non-commercial or research purposes under this [license](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/blob/main/LICENSE). Possible research areas and tasks include

*   Research on generative models.
*   Safe deployment of models which have the potential to generate harmful content.
*   Probing and understanding the limitations and biases of generative models.
*   Generation of artworks and use in design and other artistic processes.
*   Applications in educational or creative tools.

For commercial use, please refer to [https://stability.ai/membership](https://stability.ai/membership).

Excluded uses are described below.

### [](#out-of-scope-use)Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).

[](#limitations-and-bias)Limitations and Bias
---------------------------------------------

### [](#limitations)Limitations

*   The generated videos are rather short (<= 4sec), and the model does not achieve perfect photorealism.
*   The model may generate videos without motion, or very slow camera pans.
*   The model cannot be controlled through text.
*   The model cannot render legible text.
*   Faces and people in general may not be generated properly.
*   The autoencoding part of the model is lossy.

### [](#recommendations)Recommendations

The model is intended for both non-commercial and commercial usage.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

Check out [https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)

[](#appendix)Appendix:
======================

All considered potential data sources were included for final training, with none held out as the proposed data filtering methods described in the SVD paper handle the quality control/filtering of the dataset. With regards to safety/NSFW filtering, sources considered were either deemed safe or filtered with the in-house NSFW filters. No explicit human labor is involved in training data preparation. However, human evaluation for model outputs and quality was extensively used to evaluate model quality and performance. The evaluations were performed with third-party contractor platforms (Amazon Sagemaker, Amazon Mechanical Turk, Prolific) with fluent English-speaking contractors from various countries, primarily from the USA, UK, and Canada. Each worker was paid $12/hr for the time invested in the evaluation. No other third party was involved in the development of this model; the model was fully developed in-house at Stability AI. Training the SVD checkpoints required a total of approximately 200,000 A100 80GB hours. The majority of the training occurred on 48 \* 8 A100s, while some stages took more/less than that. The resulting CO2 emission is ~19,000kg CO2 eq., and energy consumed is ~64000 kWh. The released checkpoints (SVD/SVD-XT) are image-to-video models that generate short videos/animations closely following the given input image. Since the model relies on an existing supplied image, the risk of disclosing specific material or novel unsafe content is minimal. This was also evaluated by third-party independent red-teaming services, which agree with our conclusion to a high degree of confidence (>90% in various areas of safety red-teaming). The external evaluations were also performed for trustworthiness, leading to >95% confidence in real, trustworthy videos. With the default settings at the time of release, SVD takes ~100s for generation, and SVD-XT takes ~180s on an A100 80GB card. Several optimizations to trade off quality / memory / speed can be done to perform faster inference or inference on lower VRAM cards. The information related to the model and its development process and usage protocols can be found in the GitHub repo, associated research paper, and HuggingFace model page/cards. The released model inference & demo code has image-level watermarking enabled by default, which can be used to detect the outputs. This is done via the imWatermark Python library.  
The model can be used to generate videos from static initial images. However, we prohibit unlawful, obscene, or misleading uses of the model consistent with the terms of our license. For the open-weights release, our training data filtering mitigations alleviate this to some extent. These restrictions are explicitly enforced on user-facing interfaces at stablevideo.com, where a warning is issued. We do not take any responsibility for third-party interfaces. Submitting initial images that bypass input filters to tease out offensive or inappropriate content listed above is also prohibited. Safety filtering checks at stablevideo.com run on model inputs and outputs independently. More details on our user-facing interfaces can be found here: [https://www.stablevideo.com/faq](https://www.stablevideo.com/faq)  
For stablevideo.com, we store preference data in the form of upvotes/downvotes on user-generated videos, and we have a pairwise ranker that runs while a user generates videos. This usage data is solely used for improving Stability AIs future image/video models and services. No other third-party entities are given access to the usage data beyond Stability AI and maintainers of stablevideo.com. For usage statistics of SVD, we refer interested users to HuggingFace model download/usage statistics as a primary indicator. Third-party applications also have reported model usage statistics. We might also consider releasing aggregate usage statistics of stablevideo.com on reaching some milestones.

## Model overview

The `stable-video-diffusion-img2vid-xt` model is a diffusion-based generative model developed by [Stability AI](https://aimodels.fyi/creators/huggingFace/stabilityai) that takes in a still image and generates a short video clip from it. It is an extension of the [SVD Image-to-Video](https://aimodels.fyi/models/huggingFace/stable-video-diffusion-img2vid-stabilityai) model, generating 25 frames at a resolution of 576x1024 compared to the 14 frames of the earlier model. This model was trained on a large dataset and finetuned to improve temporal consistency and video quality.

## Model inputs and outputs

The `stable-video-diffusion-img2vid-xt` model takes in a single image as input and generates a short video clip as output. The input image must be 576x1024 pixels in size.

### Inputs
- **Image**: A 576x1024 pixel image that serves as the conditioning frame for the video generation.

### Outputs
- **Video**: A 25 frame video clip at 576x1024 resolution, generated from the input image.

## Capabilities

The `stable-video-diffusion-img2vid-xt` model is capable of generating short, high-quality video clips from a single input image. It is able to capture movement, action, and dynamic scenes based on the content of the conditioning image. While it does not achieve perfect photorealism, the generated videos demonstrate impressive temporal consistency and visual fidelity.

## What can I use it for?

The `stable-video-diffusion-img2vid-xt` model is intended for research purposes, such as exploring generative models, probing the limitations of video generation, and developing artistic or creative applications. It could be used to generate dynamic visual content for design, educational, or entertainment purposes. However, the model should not be used to generate content that is harmful, misleading, or in violation of Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).

## Things to try

One interesting aspect of the `stable-video-diffusion-img2vid-xt` model is its ability to generate video from a single image, capturing a sense of motion and dynamism that goes beyond the static source. Experimenting with different types of input images, such as landscapes, portraits, or abstract compositions, could lead to a diverse range of video outputs that showcase the model's flexibility and creativity. Additionally, you could try varying the prompt or conditioning parameters to see how the model responds and explore the limits of its capabilities.

[](#sdxl-turbo-model-card)SDXL-Turbo Model Card
===============================================

[![row01](/stabilityai/sdxl-turbo/resolve/main/output_tile.jpg)](/stabilityai/sdxl-turbo/blob/main/output_tile.jpg) SDXL-Turbo is a fast generative text-to-image model that can synthesize photorealistic images from a text prompt in a single network evaluation. A real-time demo is available here: [http://clipdrop.co/stable-diffusion-turbo](http://clipdrop.co/stable-diffusion-turbo)

Please note: For commercial use, please refer to [https://stability.ai/membership](https://stability.ai/membership).

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

SDXL-Turbo is a distilled version of [SDXL 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), trained for real-time synthesis. SDXL-Turbo is based on a novel training method called Adversarial Diffusion Distillation (ADD) (see the [technical report](https://stability.ai/research/adversarial-diffusion-distillation)), which allows sampling large-scale foundational image diffusion models in 1 to 4 steps at high image quality. This approach uses score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal and combines this with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps.

*   **Developed by:** Stability AI
*   **Funded by:** Stability AI
*   **Model type:** Generative text-to-image model
*   **Finetuned from model:** [SDXL 1.0 Base](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)

### [](#model-sources)Model Sources

For research purposes, we recommend our `generative-models` Github repository ([https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)), which implements the most popular diffusion frameworks (both training and inference).

*   **Repository:** [https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)
*   **Paper:** [https://stability.ai/research/adversarial-diffusion-distillation](https://stability.ai/research/adversarial-diffusion-distillation)
*   **Demo:** [http://clipdrop.co/stable-diffusion-turbo](http://clipdrop.co/stable-diffusion-turbo)

[](#evaluation)Evaluation
-------------------------

[![comparison1](/stabilityai/sdxl-turbo/resolve/main/image_quality_one_step.png)](/stabilityai/sdxl-turbo/blob/main/image_quality_one_step.png) [![comparison2](/stabilityai/sdxl-turbo/resolve/main/prompt_alignment_one_step.png)](/stabilityai/sdxl-turbo/blob/main/prompt_alignment_one_step.png) The charts above evaluate user preference for SDXL-Turbo over other single- and multi-step models. SDXL-Turbo evaluated at a single step is preferred by human voters in terms of image quality and prompt following over LCM-XL evaluated at four (or fewer) steps. In addition, we see that using four steps for SDXL-Turbo further improves performance. For details on the user study, we refer to the [research paper](https://stability.ai/research/adversarial-diffusion-distillation).

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

The model is intended for both non-commercial and commercial usage. You can use this model for non-commercial or research purposes under this [license](https://huggingface.co/stabilityai/sdxl-turbo/blob/main/LICENSE.TXT). Possible research areas and tasks include

*   Research on generative models.
*   Research on real-time applications of generative models.
*   Research on the impact of real-time generative models.
*   Safe deployment of models which have the potential to generate harmful content.
*   Probing and understanding the limitations and biases of generative models.
*   Generation of artworks and use in design and other artistic processes.
*   Applications in educational or creative tools.

For commercial use, please refer to [https://stability.ai/membership](https://stability.ai/membership).

Excluded uses are described below.

### [](#diffusers)Diffusers

    pip install diffusers transformers accelerate --upgrade
    

*   **Text-to-image**:

SDXL-Turbo does not make use of `guidance_scale` or `negative_prompt`, we disable it with `guidance_scale=0.0`. Preferably, the model generates images of size 512x512 but higher image sizes work as well. A **single step** is enough to generate high quality images.

    from diffusers import AutoPipelineForText2Image
    import torch
    
    pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
    pipe.to("cuda")
    
    prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe."
    
    image = pipe(prompt=prompt, num_inference_steps=1, guidance_scale=0.0).images[0]
    

*   **Image-to-image**:

When using SDXL-Turbo for image-to-image generation, make sure that `num_inference_steps` \* `strength` is larger or equal to 1. The image-to-image pipeline will run for `int(num_inference_steps * strength)` steps, _e.g._ 0.5 \* 2.0 = 1 step in our example below.

    from diffusers import AutoPipelineForImage2Image
    from diffusers.utils import load_image
    import torch
    
    pipe = AutoPipelineForImage2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
    pipe.to("cuda")
    
    init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png").resize((512, 512))
    
    prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
    
    image = pipe(prompt, image=init_image, num_inference_steps=2, strength=0.5, guidance_scale=0.0).images[0]
    

### [](#out-of-scope-use)Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).

[](#limitations-and-bias)Limitations and Bias
---------------------------------------------

### [](#limitations)Limitations

*   The generated images are of a fixed resolution (512x512 pix), and the model does not achieve perfect photorealism.
*   The model cannot render legible text.
*   Faces and people in general may not be generated properly.
*   The autoencoding part of the model is lossy.

### [](#recommendations)Recommendations

The model is intended for both non-commercial and commercial usage.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

Check out [https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)

## Model Overview

`sdxl-turbo` is a fast generative text-to-image model developed by [Stability AI](https://aimodels.fyi/creators/huggingFace/stabilityai). It is a distilled version of the [SDXL 1.0 Base](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model, trained using a novel technique called Adversarial Diffusion Distillation (ADD) to enable high-quality image synthesis in just 1-4 steps. This approach leverages a large-scale off-the-shelf image diffusion model as a teacher signal and combines it with an adversarial loss to ensure high fidelity even with fewer sampling steps.

## Model Inputs and Outputs

`sdxl-turbo` is a text-to-image generative model. It takes a text prompt as input and generates a corresponding photorealistic image as output. The model is optimized for real-time synthesis, allowing for fast image generation from a text description.

### Inputs
- Text prompt describing the desired image

### Outputs
- Photorealistic image generated based on the input text prompt

## Capabilities

`sdxl-turbo` is capable of generating high-quality, photorealistic images from text prompts in a single network evaluation. This makes it suitable for real-time, interactive applications where fast image synthesis is required.

## What Can I Use It For?

With `sdxl-turbo`'s fast and high-quality image generation capabilities, you can explore a variety of applications, such as interactive art tools, visual storytelling platforms, or even prototyping and visualization for product design. The model's real-time performance also makes it well-suited for use in live demos or AI-powered creative assistants. For commercial use, please refer to [Stability AI's membership options](https://stability.ai/membership).

## Things to Try

One interesting aspect of `sdxl-turbo` is its ability to generate images with a high degree of fidelity using just 1-4 sampling steps. This makes it possible to experiment with rapid image synthesis, where the user can quickly generate and iterate on visual ideas. Try exploring different text prompts and observe how the model's output changes with the number of sampling steps.

[](#stable-diffusion-v2-model-card)Stable Diffusion v2 Model Card
=================================================================

This model card focuses on the model associated with the Stable Diffusion v2 model, available [here](https://github.com/Stability-AI/stablediffusion).

This `stable-diffusion-2` model is resumed from [stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) (`512-base-ema.ckpt`) and trained for 150k steps using a [v-objective](https://arxiv.org/abs/2202.00512) on the same dataset. Resumed for another 140k steps on `768x768` images.

[![image](https://github.com/Stability-AI/stablediffusion/blob/main/assets/stable-samples/txt2img/768/merged-0005.png?raw=true)](https://github.com/Stability-AI/stablediffusion/blob/main/assets/stable-samples/txt2img/768/merged-0005.png?raw=true)

*   Use it with the [`stablediffusion`](https://github.com/Stability-AI/stablediffusion) repository: download the `768-v-ema.ckpt` [here](https://huggingface.co/stabilityai/stable-diffusion-2/blob/main/768-v-ema.ckpt).
*   Use it with  [`diffusers`](https://huggingface.co/stabilityai/stable-diffusion-2#examples)

[](#model-details)Model Details
-------------------------------

*   **Developed by:** Robin Rombach, Patrick Esser
    
*   **Model type:** Diffusion-based text-to-image generation model
    
*   **Language(s):** English
    
*   **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-2/blob/main/LICENSE-MODEL)
    
*   **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip)).
    
*   **Resources for more information:** [GitHub Repository](https://github.com/Stability-AI/).
    
*   **Cite as:**
    
        @InProceedings{Rombach_2022_CVPR,
            author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
            title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
            booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            month     = {June},
            year      = {2022},
            pages     = {10684-10695}
        }
        
    

[](#examples)Examples
---------------------

Using the ['s Diffusers library](https://github.com/huggingface/diffusers) to run Stable Diffusion 2 in a simple and efficient manner.

    pip install diffusers transformers accelerate scipy safetensors
    

Running the pipeline (if you don't swap the scheduler it will run with the default DDIM, in this example we are swapping it to EulerDiscreteScheduler):

    from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
    
    model_id = "stabilityai/stable-diffusion-2"
    
    # Use the Euler scheduler here instead
    scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
    pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
    pipe = pipe.to("cuda")
    
    prompt = "a photo of an astronaut riding a horse on mars"
    image = pipe(prompt).images[0]
        
    image.save("astronaut_rides_horse.png")
    

**Notes**:

*   Despite not being a dependency, we highly recommend you to install [xformers](https://github.com/facebookresearch/xformers) for memory efficient attention (better performance)
*   If you have low GPU RAM available, make sure to add a `pipe.enable_attention_slicing()` after sending it to `cuda` for less VRAM usage (to the cost of speed)

[](#uses)Uses
=============

[](#direct-use)Direct Use
-------------------------

The model is intended for research purposes only. Possible research areas and tasks include

*   Safe deployment of models which have the potential to generate harmful content.
*   Probing and understanding the limitations and biases of generative models.
*   Generation of artworks and use in design and other artistic processes.
*   Applications in educational or creative tools.
*   Research on generative models.

Excluded uses are described below.

### [](#misuse-malicious-use-and-out-of-scope-use)Misuse, Malicious Use, and Out-of-Scope Use

_Note: This section is originally taken from the [DALLE-MINI model card](https://huggingface.co/dalle-mini/dalle-mini), was used for Stable Diffusion v1, but applies in the same way to Stable Diffusion v2_.

The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.

#### [](#out-of-scope-use)Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

#### [](#misuse-and-malicious-use)Misuse and Malicious Use

Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:

*   Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
*   Intentionally promoting or propagating discriminatory content or harmful stereotypes.
*   Impersonating individuals without their consent.
*   Sexual content without consent of the people who might see it.
*   Mis- and disinformation
*   Representations of egregious violence and gore
*   Sharing of copyrighted or licensed material in violation of its terms of use.
*   Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.

[](#limitations-and-bias)Limitations and Bias
---------------------------------------------

### [](#limitations)Limitations

*   The model does not achieve perfect photorealism
*   The model cannot render legible text
*   The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to A red cube on top of a blue sphere
*   Faces and people in general may not be generated properly.
*   The model was trained mainly with English captions and will not work as well in other languages.
*   The autoencoding part of the model is lossy
*   The model was trained on a subset of the large-scale dataset [LAION-5B](https://laion.ai/blog/laion-5b/), which contains adult, violent and sexual content. To partially mitigate this, we have filtered the dataset using LAION's NFSW detector (see Training section).

### [](#bias)Bias

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion was primarily trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/), which consists of images that are limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts. Stable Diffusion v2 mirrors and exacerbates biases to such a degree that viewer discretion must be advised irrespective of the input or its intent.

[](#training)Training
---------------------

**Training Data** The model developers used the following dataset for training the model:

*   LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector, with a "p\_unsafe" score of 0.1 (conservative). For more details, please refer to LAION-5B's [NeurIPS 2022](https://openreview.net/forum?id=M3Y74vmsMcY) paper and reviewer discussions on the topic.

**Training Procedure** Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

*   Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
*   Text prompts are encoded through the OpenCLIP-ViT/H text-encoder.
*   The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
*   The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called _v-objective_, see [https://arxiv.org/abs/2202.00512](https://arxiv.org/abs/2202.00512).

We currently provide the following checkpoints:

*   `512-base-ema.ckpt`: 550k steps at resolution `256x256` on a subset of [LAION-5B](https://laion.ai/blog/laion-5b/) filtered for explicit pornographic material, using the [LAION-NSFW classifier](https://github.com/LAION-AI/CLIP-based-NSFW-Detector) with `punsafe=0.1` and an [aesthetic score](https://github.com/christophschuhmann/improved-aesthetic-predictor) >= `4.5`. 850k steps at resolution `512x512` on the same dataset with resolution `>= 512x512`.
    
*   `768-v-ema.ckpt`: Resumed from `512-base-ema.ckpt` and trained for 150k steps using a [v-objective](https://arxiv.org/abs/2202.00512) on the same dataset. Resumed for another 140k steps on a `768x768` subset of our dataset.
    
*   `512-depth-ema.ckpt`: Resumed from `512-base-ema.ckpt` and finetuned for 200k steps. Added an extra input channel to process the (relative) depth prediction produced by [MiDaS](https://github.com/isl-org/MiDaS) (`dpt_hybrid`) which is used as an additional conditioning. The additional input channels of the U-Net which process this extra information were zero-initialized.
    
*   `512-inpainting-ema.ckpt`: Resumed from `512-base-ema.ckpt` and trained for another 200k steps. Follows the mask-generation strategy presented in [LAMA](https://github.com/saic-mdal/lama) which, in combination with the latent VAE representations of the masked image, are used as an additional conditioning. The additional input channels of the U-Net which process this extra information were zero-initialized. The same strategy was used to train the [1.5-inpainting checkpoint](https://github.com/saic-mdal/lama).
    
*   `x4-upscaling-ema.ckpt`: Trained for 1.25M steps on a 10M subset of LAION containing images `>2048x2048`. The model was trained on crops of size `512x512` and is a text-guided [latent upscaling diffusion model](https://arxiv.org/abs/2112.10752). In addition to the textual input, it receives a `noise_level` as an input parameter, which can be used to add noise to the low-resolution input according to a [predefined diffusion schedule](/stabilityai/stable-diffusion-2/blob/main/configs/stable-diffusion/x4-upscaling.yaml).
    
*   **Hardware:** 32 x 8 x A100 GPUs
    
*   **Optimizer:** AdamW
    
*   **Gradient Accumulations**: 1
    
*   **Batch:** 32 x 8 x 2 x 4 = 2048
    
*   **Learning rate:** warmup to 0.0001 for 10,000 steps and then kept constant
    

[](#evaluation-results)Evaluation Results
-----------------------------------------

Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 steps DDIM sampling steps show the relative improvements of the checkpoints:

[![pareto](/stabilityai/stable-diffusion-2/resolve/main/model-variants.jpg)](/stabilityai/stable-diffusion-2/blob/main/model-variants.jpg)

Evaluated using 50 DDIM steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution. Not optimized for FID scores.

[](#environmental-impact)Environmental Impact
---------------------------------------------

**Stable Diffusion v1** **Estimated Emissions** Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

*   **Hardware Type:** A100 PCIe 40GB
*   **Hours used:** 200000
*   **Cloud Provider:** AWS
*   **Compute Region:** US-east
*   **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 15000 kg CO2 eq.

[](#citation)Citation
---------------------

    @InProceedings{Rombach_2022_CVPR,
        author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
        title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2022},
        pages     = {10684-10695}
    }
    

_This model card was written by: Robin Rombach, Patrick Esser and David Ha and is based on the [Stable Diffusion v1](https://github.com/CompVis/stable-diffusion/blob/main/Stable_Diffusion_v1_Model_Card.md) and [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini)._

## Model overview

The `stable-diffusion-2` model is a diffusion-based text-to-image generation model developed by [Stability AI](https://aimodels.fyi/creators/huggingFace/stabilityai). It is an improved version of the original Stable Diffusion model, trained for 150k steps using a v-objective on the same dataset as the base model. The model is capable of generating high-resolution images (768x768) from text prompts, and can be used with the `stablediffusion` repository or the `diffusers` library.

Similar models include the [SDXL-Turbo](https://aimodels.fyi/models/huggingFace/sdxl-turbo-stabilityai) and [Stable Cascade](https://aimodels.fyi/models/huggingFace/stable-cascade-stabilityai) models, which are also developed by Stability AI. The SDXL-Turbo model is a distilled version of the SDXL 1.0 model, optimized for real-time synthesis, while the Stable Cascade model uses a novel multi-stage architecture to achieve high-quality image generation with a smaller latent space.

## Model inputs and outputs

### Inputs
- **Text prompt**: A text description of the desired image, which the model uses to generate the corresponding image.

### Outputs
- **Image**: The generated image based on the input text prompt, with a resolution of 768x768 pixels.

## Capabilities

The `stable-diffusion-2` model can be used to generate a wide variety of images from text prompts, including photorealistic scenes, imaginative concepts, and abstract compositions. The model has been trained on a large and diverse dataset, allowing it to handle a broad range of subject matter and styles. 

Some example use cases for the model include:
- Creating original artwork and illustrations
- Generating concept art for games, films, or other media
- Experimenting with different visual styles and aesthetics
- Assisting with visual brainstorming and ideation

## What can I use it for?

The `stable-diffusion-2` model is intended for both non-commercial and commercial usage. For non-commercial or research purposes, you can use the model under the [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-2/blob/main/LICENSE-MODEL). Possible research areas and tasks include:

- Research on generative models
- Research on the impact of real-time generative models
- Probing and understanding the limitations and biases of generative models
- Generation of artworks and use in design and other artistic processes
- Applications in educational or creative tools

For commercial use, please refer to [https://stability.ai/membership](https://stability.ai/membership).

## Things to try

One interesting aspect of the `stable-diffusion-2` model is its ability to generate highly detailed and photorealistic images, even for complex scenes and concepts. Try experimenting with detailed prompts that describe intricate settings, characters, or objects, and see the model's ability to bring those visions to life.

Additionally, you can explore the model's versatility by generating images in a variety of styles, from realism to surrealism, impressionism to expressionism. Experiment with different artistic styles and see how the model interprets and renders them.

[](#sd-xl-10-refiner-model-card)SD-XL 1.0-refiner Model Card
============================================================

[![row01](/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/01.png)](/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/01.png)

[](#model)Model
---------------

[![pipeline](/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/pipeline.png)](/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/pipeline.png)

[SDXL](https://arxiv.org/abs/2307.01952) consists of an [ensemble of experts](https://arxiv.org/abs/2211.01324) pipeline for latent diffusion: In a first step, the base model (available here: [https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)) is used to generate (noisy) latents, which are then further processed with a refinement model specialized for the final denoising steps. Note that the base model can be used as a standalone module.

Alternatively, we can use a two-stage pipeline as follows: First, the base model is used to generate latents of the desired output size. In the second step, we use a specialized high-resolution model and apply a technique called SDEdit ([https://arxiv.org/abs/2108.01073](https://arxiv.org/abs/2108.01073), also known as "img2img") to the latents generated in the first step, using the same prompt. This technique is slightly slower than the first one, as it requires more function evaluations.

Source code is available at [https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models) .

### [](#model-description)Model Description

*   **Developed by:** Stability AI
*   **Model type:** Diffusion-based text-to-image generative model
*   **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/LICENSE.md)
*   **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses two fixed, pretrained text encoders ([OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main)).
*   **Resources for more information:** Check out our [GitHub Repository](https://github.com/Stability-AI/generative-models) and the [SDXL report on arXiv](https://arxiv.org/abs/2307.01952).

### [](#model-sources)Model Sources

For research purposes, we recommned our `generative-models` Github repository ([https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)), which implements the most popoular diffusion frameworks (both training and inference) and for which new functionalities like distillation will be added over time. [Clipdrop](https://clipdrop.co/stable-diffusion) provides free SDXL inference.

*   **Repository:** [https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)
*   **Demo:** [https://clipdrop.co/stable-diffusion](https://clipdrop.co/stable-diffusion)

[](#evaluation)Evaluation
-------------------------

[![comparison](/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/comparison.png)](/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/comparison.png) The chart above evaluates user preference for SDXL (with and without refinement) over SDXL 0.9 and Stable Diffusion 1.5 and 2.1. The SDXL base model performs significantly better than the previous variants, and the model combined with the refinement module achieves the best overall performance.

### [](#-diffusers) Diffusers

Make sure to upgrade diffusers to >= 0.18.0:

    pip install diffusers --upgrade
    

In addition make sure to install `transformers`, `safetensors`, `accelerate` as well as the invisible watermark:

    pip install invisible_watermark transformers accelerate safetensors
    

Yon can then use the refiner to improve images.

    import torch
    from diffusers import StableDiffusionXLImg2ImgPipeline
    from diffusers.utils import load_image
    
    pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
    )
    pipe = pipe.to("cuda")
    url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"
    
    init_image = load_image(url).convert("RGB")
    prompt = "a photo of an astronaut riding a horse on mars"
    image = pipe(prompt, image=init_image).images
    

When using `torch >= 2.0`, you can improve the inference speed by 20-30% with torch.compile. Simple wrap the unet with torch compile before running the pipeline:

    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
    

If you are limited by GPU VRAM, you can enable _cpu offloading_ by calling `pipe.enable_model_cpu_offload` instead of `.to("cuda")`:

    - pipe.to("cuda")
    + pipe.enable_model_cpu_offload()
    

For more advanced use cases, please have a look at [the docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl).

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

The model is intended for research purposes only. Possible research areas and tasks include

*   Generation of artworks and use in design and other artistic processes.
*   Applications in educational or creative tools.
*   Research on generative models.
*   Safe deployment of models which have the potential to generate harmful content.
*   Probing and understanding the limitations and biases of generative models.

Excluded uses are described below.

### [](#out-of-scope-use)Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

[](#limitations-and-bias)Limitations and Bias
---------------------------------------------

### [](#limitations)Limitations

*   The model does not achieve perfect photorealism
*   The model cannot render legible text
*   The model struggles with more difficult tasks which involve compositionality, such as rendering an image corresponding to A red cube on top of a blue sphere
*   Faces and people in general may not be generated properly.
*   The autoencoding part of the model is lossy.

### [](#bias)Bias

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

## Model Overview

The `stable-diffusion-xl-refiner-1.0` model is a diffusion-based text-to-image generative model developed by [Stability AI](https://aimodels.fyi/creators/huggingFace/stabilityai). It is part of the SDXL model family, which consists of an ensemble of experts pipeline for latent diffusion. The base model is used to generate initial latents, which are then further processed by a specialized refinement model to produce the final high-quality image.

The model can be used in two ways - either through a single-stage pipeline that uses the base and refiner models together, or a two-stage pipeline that first generates latents with the base model and then applies the refiner model. The two-stage approach is slightly slower but can produce even higher quality results.

Similar models in the SDXL family include the [sdxl-turbo](https://aimodels.fyi/models/huggingFace/sdxl-turbo-stabilityai) and [sdxl](https://aimodels.fyi/models/huggingFace/sdxl-asiryan) models, which offer different trade-offs in terms of speed, quality, and ease of use.

## Model Inputs and Outputs

### Inputs
- **Text prompt**: A natural language description of the desired image.

### Outputs
- **Image**: A high-quality generated image matching the provided text prompt.

## Capabilities

The `stable-diffusion-xl-refiner-1.0` model can generate photorealistic images from text prompts covering a wide range of subjects and styles. It excels at producing detailed, visually striking images that closely align with the provided description.

## What Can I Use It For?

The `stable-diffusion-xl-refiner-1.0` model is intended for both non-commercial and commercial usage. Possible applications include:

- **Research on generative models**: Studying the model's capabilities, limitations, and biases can provide valuable insights for the field of AI-generated content.
- **Creative and artistic processes**: The model can be used to generate unique and inspiring images for use in design, illustration, and other artistic endeavors.
- **Educational tools**: The model could be integrated into educational applications to foster creativity and visual learning.

For commercial use, please refer to the [Stability AI membership page](https://stability.ai/membership).

## Things to Try

One interesting aspect of the `stable-diffusion-xl-refiner-1.0` model is its ability to produce high-quality images through a two-stage process. Try experimenting with both the single-stage and two-stage pipelines to see how the results differ in terms of speed, quality, and other characteristics. You may find that the two-stage approach is better suited for certain types of prompts or use cases.

Additionally, explore how the model handles more complex or abstract prompts, such as those involving multiple objects, scenes, or concepts. The model's performance on these types of prompts can provide insights into its understanding of language and compositional reasoning.

[](#sd-xl-09-base-model-card)SD-XL 0.9-base Model Card
======================================================

[![row01](/stabilityai/stable-diffusion-xl-base-0.9/media/main/01.png)](/stabilityai/stable-diffusion-xl-base-0.9/blob/main/01.png)

[](#model)Model
---------------

[![pipeline](/stabilityai/stable-diffusion-xl-base-0.9/media/main/pipeline.png)](/stabilityai/stable-diffusion-xl-base-0.9/blob/main/pipeline.png)

SDXL consists of a two-step pipeline for latent diffusion: First, we use a base model to generate latents of the desired output size. In the second step, we use a specialized high-resolution model and apply a technique called SDEdit ([https://arxiv.org/abs/2108.01073](https://arxiv.org/abs/2108.01073), also known as "img2img") to the latents generated in the first step, using the same prompt.

### [](#model-description)Model Description

*   **Developed by:** Stability AI
*   **Model type:** Diffusion-based text-to-image generative model
*   **License:** [SDXL 0.9 Research License](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9/blob/main/LICENSE.md)
*   **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses two fixed, pretrained text encoders ([OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main)).
*   **Resources for more information:** [GitHub Repository](https://github.com/Stability-AI/generative-models) [SDXL paper on arXiv](https://arxiv.org/abs/2307.01952).

### [](#model-sources)Model Sources

*   **Repository:** [https://github.com/Stability-AI/generative-models](https://github.com/Stability-AI/generative-models)
*   **Demo \[optional\]:** [https://clipdrop.co/stable-diffusion](https://clipdrop.co/stable-diffusion)

### [](#-diffusers) Diffusers

Make sure to upgrade diffusers to >= 0.18.0:

    pip install diffusers --upgrade
    

In addition make sure to install `transformers`, `safetensors`, `accelerate` as well as the invisible watermark:

    pip install invisible_watermark transformers accelerate safetensors
    

You can use the model then as follows

    from diffusers import DiffusionPipeline
    import torch
    
    pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
    pipe.to("cuda")
    
    # if using torch < 2.0
    # pipe.enable_xformers_memory_efficient_attention()
    
    prompt = "An astronaut riding a green horse"
    
    images = pipe(prompt=prompt).images[0]
    

When using `torch >= 2.0`, you can improve the inference speed by 20-30% with torch.compile. Simple wrap the unet with torch compile before running the pipeline:

    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
    

If you are limited by GPU VRAM, you can enable _cpu offloading_ by calling `pipe.enable_model_cpu_offload` instead of `.to("cuda")`:

    - pipe.to("cuda")
    + pipe.enable_model_cpu_offload()
    

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

The model is intended for research purposes only. Possible research areas and tasks include

*   Generation of artworks and use in design and other artistic processes.
*   Applications in educational or creative tools.
*   Research on generative models.
*   Safe deployment of models which have the potential to generate harmful content.
*   Probing and understanding the limitations and biases of generative models.

Excluded uses are described below.

### [](#out-of-scope-use)Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

[](#limitations-and-bias)Limitations and Bias
---------------------------------------------

### [](#limitations)Limitations

*   The model does not achieve perfect photorealism
*   The model cannot render legible text
*   The model struggles with more difficult tasks which involve compositionality, such as rendering an image corresponding to A red cube on top of a blue sphere
*   Faces and people in general may not be generated properly.
*   The autoencoding part of the model is lossy.

### [](#bias)Bias

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

[](#evaluation)Evaluation
-------------------------

[![comparison](/stabilityai/stable-diffusion-xl-base-0.9/media/main/comparison.png)](/stabilityai/stable-diffusion-xl-base-0.9/blob/main/comparison.png) The chart above evaluates user preference for SDXL (with and without refinement) over Stable Diffusion 1.5 and 2.1. The SDXL base model performs significantly better than the previous variants, and the model combined with the refinement module achieves the best overall performance.

## Model overview

The `stable-diffusion-xl-base-0.9` model is a text-to-image generative model developed by [Stability AI](https://aimodels.fyi/creators/huggingFace/stabilityai). It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses two fixed, pretrained text encoders ([OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main)). The model consists of a two-step pipeline for latent diffusion - first generating latents of the desired output size, then refining them using a specialized high-resolution model and a technique called SDEdit ([https://arxiv.org/abs/2108.01073](https://arxiv.org/abs/2108.01073)). This model builds upon the capabilities of previous Stable Diffusion models, improving image quality and prompt following.

## Model inputs and outputs

### Inputs
- **Prompt**: A text description of the desired image to generate.

### Outputs
- **Image**: A 512x512 pixel image generated based on the input prompt.

## Capabilities

The `stable-diffusion-xl-base-0.9` model can generate a wide variety of images based on text prompts, from realistic scenes to fantastical creations. It performs significantly better than previous Stable Diffusion models in terms of image quality and prompt following, as demonstrated by user preference evaluations. The model can be particularly useful for tasks like artwork generation, creative design, and educational applications.

## What can I use it for?

The `stable-diffusion-xl-base-0.9` model is intended for research purposes, such as generation of artworks, applications in educational or creative tools, research on generative models, and probing the limitations and biases of the model. While the model is not suitable for generating factual or true representations of people or events, it can be a powerful tool for artistic expression and exploration. For commercial use, please refer to Stability AI's [membership options](https://stability.ai/membership).

## Things to try

One interesting aspect of the `stable-diffusion-xl-base-0.9` model is its ability to generate high-quality images using a two-step pipeline. Try experimenting with different combinations of the base model and refinement model to see how the results vary in terms of image quality, detail, and prompt following. You can also explore the model's capabilities in generating specific types of imagery, such as surreal or fantastical scenes, and see how it handles more complex prompts involving compositional elements.

[](#improved-autoencoders)Improved Autoencoders
===============================================

[](#utilizing)Utilizing
-----------------------

These weights are intended to be used with the original [CompVis Stable Diffusion codebase](https://github.com/CompVis/stable-diffusion). If you are looking for the model to use with the  diffusers library, [come here](https://huggingface.co/CompVis/stabilityai/sd-vae-ft-ema).

[](#decoder-finetuning)Decoder Finetuning
-----------------------------------------

We publish two kl-f8 autoencoder versions, finetuned from the original [kl-f8 autoencoder](https://github.com/CompVis/latent-diffusion#pretrained-autoencoding-models) on a 1:1 ratio of [LAION-Aesthetics](https://laion.ai/blog/laion-aesthetics/) and LAION-Humans, an unreleased subset containing only SFW images of humans. The intent was to fine-tune on the Stable Diffusion training set (the autoencoder was originally trained on OpenImages) but also enrich the dataset with images of humans to improve the reconstruction of faces. The first, _ft-EMA_, was resumed from the original checkpoint, trained for 313198 steps and uses EMA weights. It uses the same loss configuration as the original checkpoint (L1 + LPIPS). The second, _ft-MSE_, was resumed from _ft-EMA_ and uses EMA weights and was trained for another 280k steps using a different loss, with more emphasis on MSE reconstruction (MSE + 0.1 \* LPIPS). It produces somewhat \`\`smoother'' outputs. The batch size for both versions was 192 (16 A100s, batch size 12 per GPU). To keep compatibility with existing models, only the decoder part was finetuned; the checkpoints can be used as a drop-in replacement for the existing autoencoder..

_Original kl-f8 VAE vs f8-ft-EMA vs f8-ft-MSE_

[](#evaluation)Evaluation
-------------------------

### [](#coco-2017-256x256-val-5000-images)COCO 2017 (256x256, val, 5000 images)

Model

train steps

rFID

PSNR

SSIM

PSIM

Link

Comments

original

246803

4.99

23.4 +/- 3.8

0.69 +/- 0.14

1.01 +/- 0.28

[https://ommer-lab.com/files/latent-diffusion/kl-f8.zip](https://ommer-lab.com/files/latent-diffusion/kl-f8.zip)

as used in SD

ft-EMA

560001

4.42

23.8 +/- 3.9

0.69 +/- 0.13

0.96 +/- 0.27

[https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckpt](https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckpt)

slightly better overall, with EMA

ft-MSE

840001

4.70

24.5 +/- 3.7

0.71 +/- 0.13

0.92 +/- 0.27

[https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt](https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt)

resumed with EMA from ft-EMA, emphasis on MSE (rec. loss = MSE + 0.1 \* LPIPS), smoother outputs

### [](#laion-aesthetics-5-256x256-subset-10000-images)LAION-Aesthetics 5+ (256x256, subset, 10000 images)

Model

train steps

rFID

PSNR

SSIM

PSIM

Link

Comments

original

246803

2.61

26.0 +/- 4.4

0.81 +/- 0.12

0.75 +/- 0.36

[https://ommer-lab.com/files/latent-diffusion/kl-f8.zip](https://ommer-lab.com/files/latent-diffusion/kl-f8.zip)

as used in SD

ft-EMA

560001

1.77

26.7 +/- 4.8

0.82 +/- 0.12

0.67 +/- 0.34

[https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckpt](https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckpt)

slightly better overall, with EMA

ft-MSE

840001

1.88

27.3 +/- 4.7

0.83 +/- 0.11

0.65 +/- 0.34

[https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt](https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt)

resumed with EMA from ft-EMA, emphasis on MSE (rec. loss = MSE + 0.1 \* LPIPS), smoother outputs

### [](#visual)Visual

_Visualization of reconstructions on 256x256 images from the COCO2017 validation dataset._

  
**256x256: ft-EMA (left), ft-MSE (middle), original (right)**

![](https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00025_merged.png)

![](https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00011_merged.png)

![](https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00037_merged.png)

![](https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00043_merged.png)

![](https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00053_merged.png)

![](https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00029_merged.png)

## Model overview

The `sd-vae-ft-mse-original` model is an improved autoencoder developed by the Stability AI team. It is a fine-tuned version of the original [kl-f8 autoencoder](https://github.com/CompVis/latent-diffusion#pretrained-autoencoding-models) used in the Stable Diffusion model. The team fine-tuned the decoder on a 1:1 ratio of [LAION-Aesthetics](https://laion.ai/blog/laion-aesthetics/) and LAION-Humans datasets to improve the reconstruction of faces. Two versions were released - `ft-EMA` which uses exponential moving average (EMA) weights, and `ft-MSE` which emphasizes mean squared error (MSE) reconstruction over the original L1 and LPIPS loss.

The `sd-vae-ft-mse-original` model shows improvements over the original kl-f8 autoencoder in terms of PSNR, SSIM, and PSIM metrics on the COCO 2017 and LAION-Aesthetics datasets. The `ft-MSE` version in particular produces "smoother" outputs compared to the original.

## Model inputs and outputs

### Inputs
- Images of various sizes (originally trained on 256x256 but can handle higher resolutions)

### Outputs
- Reconstructed images from the model's latent representation
- Evaluation metrics like rFID, PSNR, SSIM, and PSIM to assess reconstruction quality

## Capabilities

The `sd-vae-ft-mse-original` model is an improved autoencoder that can be used as a drop-in replacement for the original kl-f8 autoencoder used in Stable Diffusion. It shows better performance on reconstruction tasks, especially for faces and human subjects, due to the fine-tuning on the LAION-Humans dataset.

## What can I use it for?

The `sd-vae-ft-mse-original` model can be used in the original [CompVis Stable Diffusion codebase](https://github.com/CompVis/stable-diffusion) as a replacement for the autoencoder. This can potentially improve the quality and realism of generated images, especially those involving human subjects. 

## Things to try

Researchers and developers can experiment with the different fine-tuned versions of the autoencoder (`ft-EMA` and `ft-MSE`) to see how they impact the performance and output quality of the Stable Diffusion model. The smoother outputs of the `ft-MSE` version may be beneficial for certain use cases.

[](#stable-cascade)Stable Cascade
=================================

![](/stabilityai/stable-cascade/resolve/main/figures/collage_1.jpg)

This model is built upon the [Wrstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes. How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable Diffusion 1.5.  
  
Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.

[](#model-details)Model Details
-------------------------------

### [](#model-description)Model Description

Stable Cascade is a diffusion model trained to generate images given a text prompt.

*   **Developed by:** Stability AI
*   **Funded by:** Stability AI
*   **Model type:** Generative text-to-image model

### [](#model-sources)Model Sources

For research purposes, we recommend our `StableCascade` Github repository ([https://github.com/Stability-AI/StableCascade](https://github.com/Stability-AI/StableCascade)).

*   **Repository:** [https://github.com/Stability-AI/StableCascade](https://github.com/Stability-AI/StableCascade)
*   **Paper:** [https://openreview.net/forum?id=gU58d5QeGv](https://openreview.net/forum?id=gU58d5QeGv)

### [](#model-overview)Model Overview

Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images, hence the name "Stable Cascade". Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion. However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.

![](/stabilityai/stable-cascade/resolve/main/figures/model-overview.jpg)

For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to its small size.

[](#evaluation)Evaluation
-------------------------

![](/stabilityai/stable-cascade/resolve/main/figures/comparison.png) According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Wrstchen v2 (30 inference steps).

[](#code-example)Code Example
-----------------------------

**Note:** In order to use the `torch.bfloat16` data type with the `StableCascadeDecoderPipeline` you need to have PyTorch 2.2.0 or higher installed. This also means that using the `StableCascadeCombinedPipeline` with `torch.bfloat16` requires PyTorch 2.2.0 or higher, since it calls the StableCascadeDecoderPipeline internally.

If it is not possible to install PyTorch 2.2.0 or higher in your environment, the `StableCascadeDecoderPipeline` can be used on its own with the torch.float16 data type. You can download the full precision or bf16 variant weights for the pipeline and cast the weights to torch.float16.

    pip install diffusers
    

    import torch
    from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline
    
    prompt = "an image of a shiba inu, donning a spacesuit and helmet"
    negative_prompt = ""
    
    prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16)
    decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)
    
    prior.enable_model_cpu_offload()
    prior_output = prior(
        prompt=prompt,
        height=1024,
        width=1024,
        negative_prompt=negative_prompt,
        guidance_scale=4.0,
        num_images_per_prompt=1,
        num_inference_steps=20
    )
    
    decoder.enable_model_cpu_offload()
    decoder_output = decoder(
        image_embeddings=prior_output.image_embeddings.to(torch.float16),
        prompt=prompt,
        negative_prompt=negative_prompt,
        guidance_scale=0.0,
        output_type="pil",
        num_inference_steps=10
    ).images[0]
    decoder_output.save("cascade.png")
    

### [](#using-the-lite-version-of-the-stage-b-and-stage-c-models)Using the Lite Version of the Stage B and Stage C models

    import torch
    from diffusers import (
        StableCascadeDecoderPipeline,
        StableCascadePriorPipeline,
        StableCascadeUNet,
    )
    
    prompt = "an image of a shiba inu, donning a spacesuit and helmet"
    negative_prompt = ""
    
    prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite")
    decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite")
    
    prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet)
    decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet)
    
    prior.enable_model_cpu_offload()
    prior_output = prior(
        prompt=prompt,
        height=1024,
        width=1024,
        negative_prompt=negative_prompt,
        guidance_scale=4.0,
        num_images_per_prompt=1,
        num_inference_steps=20
    )
    
    decoder.enable_model_cpu_offload()
    decoder_output = decoder(
        image_embeddings=prior_output.image_embeddings,
        prompt=prompt,
        negative_prompt=negative_prompt,
        guidance_scale=0.0,
        output_type="pil",
        num_inference_steps=10
    ).images[0]
    decoder_output.save("cascade.png")
    

### [](#loading-original-checkpoints-with-from_single_file)Loading original checkpoints with `from_single_file`

Loading the original format checkpoints is supported via `from_single_file` method in the StableCascadeUNet.

    import torch
    from diffusers import (
        StableCascadeDecoderPipeline,
        StableCascadePriorPipeline,
        StableCascadeUNet,
    )
    
    prompt = "an image of a shiba inu, donning a spacesuit and helmet"
    negative_prompt = ""
    
    prior_unet = StableCascadeUNet.from_single_file(
        "https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors",
        torch_dtype=torch.bfloat16
    )
    decoder_unet = StableCascadeUNet.from_single_file(
        "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors",
        torch_dtype=torch.bfloat16
    )
    
    prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16)
    decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16)
    
    prior.enable_model_cpu_offload()
    prior_output = prior(
        prompt=prompt,
        height=1024,
        width=1024,
        negative_prompt=negative_prompt,
        guidance_scale=4.0,
        num_images_per_prompt=1,
        num_inference_steps=20
    )
    
    decoder.enable_model_cpu_offload()
    decoder_output = decoder(
        image_embeddings=prior_output.image_embeddings,
        prompt=prompt,
        negative_prompt=negative_prompt,
        guidance_scale=0.0,
        output_type="pil",
        num_inference_steps=10
    ).images[0]
    decoder_output.save("cascade-single-file.png")
    

### [](#using-the-stablecascadecombinedpipeline)Using the `StableCascadeCombinedPipeline`

    from diffusers import StableCascadeCombinedPipeline
    
    pipe = StableCascadeCombinedPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.bfloat16)
    
    prompt = "an image of a shiba inu, donning a spacesuit and helmet"
    pipe(
        prompt=prompt,
        negative_prompt="",
        num_inference_steps=10,
        prior_num_inference_steps=20,
        prior_guidance_scale=3.0,
        width=1024,
        height=1024,
    ).images[0].save("cascade-combined.png")
    

[](#uses)Uses
-------------

### [](#direct-use)Direct Use

The model is intended for research purposes for now. Possible research areas and tasks include

*   Research on generative models.
*   Safe deployment of models which have the potential to generate harmful content.
*   Probing and understanding the limitations and biases of generative models.
*   Generation of artworks and use in design and other artistic processes.
*   Applications in educational or creative tools.

Excluded uses are described below.

### [](#out-of-scope-use)Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).

[](#limitations-and-bias)Limitations and Bias
---------------------------------------------

### [](#limitations)Limitations

*   Faces and people in general may not be generated properly.
*   The autoencoding part of the model is lossy.

### [](#recommendations)Recommendations

The model is intended for research purposes only.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

Check out [https://github.com/Stability-AI/StableCascade](https://github.com/Stability-AI/StableCascade)

## Model overview

`Stable Cascade` is a diffusion model developed by Stability AI that is capable of generating images from text prompts. It is built upon the [Wrstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and achieves a significantly higher compression factor compared to Stable Diffusion. While Stable Diffusion encodes a 1024x1024 image to 128x128, Stable Cascade is able to encode it to just 24x24 while maintaining crisp reconstructions. This allows for faster inference and cheaper training, making it well-suited for use cases where efficiency is important. The model consists of three stages - Stage A, Stage B and Stage C - with Stage A and B handling the compression and Stage C generating the final image from the compressed latent representation.

## Model inputs and outputs

Stable Cascade is a generative text-to-image model. It takes a text prompt as input and generates a corresponding image as output.

### Inputs
- Text prompt describing the desired image

### Outputs
- An image generated based on the input text prompt

## Capabilities

Stable Cascade is capable of generating high-quality images from text prompts in a highly compressed latent space, allowing for faster and more cost-effective model inference compared to other text-to-image models like Stable Diffusion. The model is well-suited for use cases where efficiency is important, and can also be fine-tuned or extended using techniques like LoRA, ControlNet, and IP-Adapter.

## What can I use it for?

The `Stable Cascade` model can be used for a variety of applications where generating images from text prompts is useful, such as:

- Creative art and design projects
- Prototyping and visualization
- Educational and research purposes
- Development of real-time generative applications

Due to its efficient architecture, the model is particularly well-suited for use cases where processing speed and cost are important factors, such as in mobile or edge computing applications.

## Things to try

One interesting aspect of the `Stable Cascade` model is its highly compressed latent space representation. You could experiment with this by trying to generate images from prompts using only the small 24x24 latent representations, and see how the image quality and fidelity to the prompt compare to using the full-resolution input. Additionally, you could explore how the model's performance and capabilities change when fine-tuned or extended using techniques like LoRA, ControlNet, and IP-Adapter, as the maintainers suggest these extensions are possible with the Stable Cascade architecture.

[](#stable-beluga-2)Stable Beluga 2
===================================

Use [Stable Chat (Research Preview)](https://chat.stability.ai/chat) to test Stability AI's best language models for free

[](#model-description)Model Description
---------------------------------------

`Stable Beluga 2` is a Llama2 70B model finetuned on an Orca style Dataset

[](#usage)Usage
---------------

Start chatting with `Stable Beluga 2` using the following code snippet:

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
    
    tokenizer = AutoTokenizer.from_pretrained("stabilityai/StableBeluga2", use_fast=False)
    model = AutoModelForCausalLM.from_pretrained("stabilityai/StableBeluga2", torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")
    system_prompt = "### System:\nYou are Stable Beluga, an AI that follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything illegal.\n\n"
    
    message = "Write me a poem please"
    prompt = f"{system_prompt}### User: {message}\n\n### Assistant:\n"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(**inputs, do_sample=True, top_p=0.95, top_k=0, max_new_tokens=256)
    
    print(tokenizer.decode(output[0], skip_special_tokens=True))
    

Stable Beluga 2 should be used with this prompt format:

    ### System:
    This is a system prompt, please behave and help the user.
    
    ### User:
    Your prompt here
    
    ### Assistant:
    The output of Stable Beluga 2
    

[](#other-beluga-models)Other Beluga Models
-------------------------------------------

[StableBeluga 1 - Delta](https://huggingface.co/stabilityai/StableBeluga1-Delta)  
[StableBeluga 13B](https://huggingface.co/stabilityai/StableBeluga-13B)  
[StableBeluga 7B](https://huggingface.co/stabilityai/StableBeluga-7B)

[](#model-details)Model Details
-------------------------------

*   **Developed by**: [Stability AI](https://stability.ai/)
*   **Model type**: Stable Beluga 2 is an auto-regressive language model fine-tuned on Llama2 70B.
*   **Language(s)**: English
*   **Library**: [HuggingFace Transformers](https://github.com/huggingface/transformers)
*   **License**: Fine-tuned checkpoints (`Stable Beluga 2`) is licensed under the [STABLE BELUGA NON-COMMERCIAL COMMUNITY LICENSE AGREEMENT](https://huggingface.co/stabilityai/StableBeluga2/blob/main/LICENSE.txt)
*   **Contact**: For questions and comments about the model, please email `lm@stability.ai`

### [](#training-dataset)Training Dataset

`Stable Beluga 2` is trained on our internal Orca-style dataset

### [](#training-procedure)Training Procedure

Models are learned via supervised fine-tuning on the aforementioned datasets, trained in mixed-precision (BF16), and optimized with AdamW. We outline the following hyperparameters:

Dataset

Batch Size

Learning Rate

Learning Rate Decay

Warm-up

Weight Decay

Betas

Orca pt1 packed

256

3e-5

Cosine to 3e-6

100

1e-6

(0.9, 0.95)

Orca pt2 unpacked

512

3e-5

Cosine to 3e-6

100

1e-6

(0.9, 0.95)

[](#ethical-considerations-and-limitations)Ethical Considerations and Limitations
---------------------------------------------------------------------------------

Beluga is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Beluga's potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Beluga, developers should perform safety testing and tuning tailored to their specific applications of the model.

[](#how-to-cite)How to cite
---------------------------

    @misc{StableBelugaModels, 
          url={[https://huggingface.co/stabilityai/StableBeluga2](https://huggingface.co/stabilityai/StableBeluga2)}, 
          title={Stable Beluga models}, 
          author={Mahan, Dakota and Carlow, Ryan and Castricato, Louis and Cooper, Nathan and Laforte, Christian}
    }
    

[](#citations)Citations
-----------------------

    @misc{touvron2023llama,
          title={Llama 2: Open Foundation and Fine-Tuned Chat Models}, 
          author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom},
          year={2023},
          eprint={2307.09288},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }
    

    @misc{mukherjee2023orca,
          title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4}, 
          author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah},
          year={2023},
          eprint={2306.02707},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }

## Model overview

`Stable Beluga 2` is a Llama2 70B model finetuned by [Stability AI](https://stability.ai/) on an Orca-style dataset. It is part of a family of Beluga models, with other variants including [StableBeluga 1 - Delta](https://huggingface.co/stabilityai/StableBeluga1-Delta), [StableBeluga 13B](https://huggingface.co/stabilityai/StableBeluga-13B), and [StableBeluga 7B](https://huggingface.co/stabilityai/StableBeluga-7B). These models are designed to be highly capable language models that follow instructions well and provide helpful, safe, and unbiased assistance.

## Model inputs and outputs

`Stable Beluga 2` is an autoregressive language model that takes text as input and generates text as output. It can be used for a variety of natural language processing tasks, such as text generation, summarization, and question answering.

### Inputs
- Text prompts

### Outputs
- Generated text
- Responses to questions or instructions

## Capabilities

`Stable Beluga 2` is a highly capable language model that can engage in open-ended dialogue, answer questions, and assist with a variety of tasks. It has been trained to follow instructions carefully and provide helpful, safe, and unbiased responses. The model performs well on benchmarks for commonsense reasoning, world knowledge, and other important language understanding capabilities.

## What can I use it for?

`Stable Beluga 2` can be used for a variety of applications, such as:

- Building conversational AI assistants
- Generating creative writing or content
- Answering questions and providing information
- Summarizing text
- Providing helpful instructions and advice

The model's strong performance on safety and helpfulness benchmarks make it well-suited for use cases that require a reliable and trustworthy AI assistant.

## Things to try

Some interesting things to try with `Stable Beluga 2` include:

- Engaging the model in open-ended dialogue to see the breadth of its conversational abilities
- Asking it to provide step-by-step instructions for completing a task
- Prompting it to generate creative stories or poems
- Evaluating its performance on specific language understanding benchmarks or tasks

The model's flexibility and focus on safety and helpfulness make it a compelling choice for a wide range of natural language processing applications.