![](https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg)

[](#wrstchen---overview)Wrstchen - Overview
---------------------------------------------

Wrstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Wrstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Wrstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing also cheaper and faster inference.

[](#wrstchen---decoder)Wrstchen - Decoder
-------------------------------------------

The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image, and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN) decodes the latents into pixel space. Together, they achieve a spatial compression of 42.

**Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, which are especially noticeable to us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!

### [](#image-sizes)Image Sizes

Wrstchen was trained on image resolutions between 1024x1024 & 1536x1536. We sometimes also observe good outputs at resolutions like 1024x2048. Feel free to try it out. We also observed that the Prior (Stage C) adapts extremely fast to new resolutions. So finetuning it at 2048x2048 should be computationally cheap. ![](https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/5pA5KUfGmvsObqiIjdGY1.jpeg)

[](#how-to-run)How to run
-------------------------

This pipeline should be run together with a prior [https://huggingface.co/warp-ai/wuerstchen-prior](https://huggingface.co/warp-ai/wuerstchen-prior):

    import torch
    from diffusers import AutoPipelineForText2Image
    
    device = "cuda"
    dtype = torch.float16
    
    pipeline =  AutoPipelineForText2Image.from_pretrained(
        "warp-diffusion/wuerstchen", torch_dtype=dtype
    ).to(device)
    
    caption = "Anthropomorphic cat dressed as a fire fighter"
    
    output = pipeline(
        prompt=caption,
        height=1024,
        width=1024,
        prior_guidance_scale=4.0,
        decoder_guidance_scale=0.0,
    ).images
    

### [](#image-sampling-times)Image Sampling Times

The figure shows the inference times (on an A100) for different batch sizes (`num_images_per_prompt`) on Wrstchen compared to [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) (without refiner). The left figure shows inference times (using torch > 2.0), whereas the right figure applies `torch.compile` to both pipelines in advance. [![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/UPhsIH2f079ZuTA_sLdVe.jpeg)](https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/UPhsIH2f079ZuTA_sLdVe.jpeg)

[](#model-details)Model Details
-------------------------------

*   **Developed by:** Pablo Pernias, Dominic Rampas
    
*   **Model type:** Diffusion-based text-to-image generation model
    
*   **Language(s):** English
    
*   **License:** MIT
    
*   **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a Diffusion model in the style of Stage C from the [Wrstchen paper](https://arxiv.org/abs/2306.00637) that uses a fixed, pretrained text encoder ([CLIP ViT-bigG/14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).
    
*   **Resources for more information:** [GitHub Repository](https://github.com/dome272/Wuerstchen), [Paper](https://arxiv.org/abs/2306.00637).
    
*   **Cite as:**
    
        @inproceedings{
              pernias2024wrstchen,
              title={W\"urstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models},
              author={Pablo Pernias and Dominic Rampas and Mats Leon Richter and Christopher Pal and Marc Aubreville},
              booktitle={The Twelfth International Conference on Learning Representations},
              year={2024},
              url={https://openreview.net/forum?id=gU58d5QeGv}
        }
        
    

[](#environmental-impact)Environmental Impact
---------------------------------------------

**Wrstchen v2** **Estimated Emissions** Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

*   **Hardware Type:** A100 PCIe 40GB
*   **Hours used:** 24602
*   **Cloud Provider:** AWS
*   **Compute Region:** US-east
*   **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 2275.68 kg CO2 eq.

## Model overview

`Wrstchen` is a diffusion model that compresses images to a highly compact latent space, reducing computational costs for both training and inference. Unlike other models that use a relatively small compression, Wrstchen achieves a 42x spatial compression through a novel two-stage compression process. The first stage is a VQGAN, and the second stage is a Diffusion Autoencoder. This allows the model to train and run much more efficiently than top-performing diffusion models.

The [Wrstchen-v2 model](https://aimodels.fyi/models/huggingFace/wuerstchen-v2-pagebrain) is a fast version of Wrstchen that can generate images in around 3 seconds, while the [original Wrstchen model](https://aimodels.fyi/models/huggingFace/wuerstchen-cjwbw) focuses on efficient pretraining of text-to-image models. The [Stable Cascade model](https://aimodels.fyi/models/huggingFace/stable-cascade-stabilityai) also builds on the Wrstchen architecture, achieving a 42x compression factor to enable faster and cheaper training and inference.

## Model inputs and outputs

Wrstchen is a text-conditional image generation model. It takes in text prompts and generates corresponding images.

### Inputs
- **Text prompt**: A description of the image to be generated, such as "a photo of an astronaut riding a horse on mars".

### Outputs
- **Generated image**: An image that corresponds to the input text prompt. The output image is generated in a highly compressed latent space and then decoded back to pixel space.

## Capabilities

Wrstchen demonstrates impressive capabilities in generating visually coherent and detailed images from text prompts, despite its highly compact internal representation. The model can handle a wide range of subject matter, from landscapes to portraits, and is able to incorporate specific details as requested in the prompt.

Due to its efficient design, Wrstchen can generate images much more quickly and at a lower computational cost than other top-performing diffusion models. This makes it well-suited for applications where efficiency is important, such as interactive creative tools or real-time generation.

## What can I use it for?

The Wrstchen model is well-suited for research and experimental applications in the field of generative AI. Potential use cases include:

- **Art and design**: Generating conceptual artwork, illustrations, or visual assets for design projects based on textual descriptions.
- **Creative tools**: Building interactive applications that allow users to generate images by describing them in natural language.
- **Research**: Studying the capabilities and limitations of highly compressed diffusion models, and exploring techniques for improving their performance and efficiency.

## Things to try

One interesting aspect of Wrstchen is its ability to generate detailed images from highly compressed latent representations. You could experiment with providing the model with different levels of compression and observe how the output quality and fidelity is affected.

Another area to explore would be the model's performance on more complex or compositional prompts, which often pose challenges for text-to-image models. Trying to generate images that combine multiple elements or require specific spatial relationships could reveal interesting insights about Wrstchen's strengths and weaknesses.