Stabilityai
Models by this creator
🛠️
stable-diffusion-xl-base-1.0
5.3K
The stable-diffusion-xl-base-1.0 model is a text-to-image generative AI model developed by Stability AI. It is a Latent Diffusion Model that uses two fixed, pretrained text encoders (OpenCLIP-ViT/G and CLIP-ViT/L). The model is an ensemble of experts pipeline, where the base model generates latents that are then further processed by a specialized refinement model. Alternatively, the base model can be used on its own to generate latents, which can then be processed using a high-resolution model and the SDEdit technique for image-to-image generation. Similar models include the stable-diffusion-xl-refiner-1.0 and stable-diffusion-xl-refiner-0.9 models, which serve as the refinement modules for the base stable-diffusion-xl-base-1.0 model. Model inputs and outputs Inputs Text prompt**: A natural language description of the desired image to generate. Outputs Generated image**: An image generated from the input text prompt. Capabilities The stable-diffusion-xl-base-1.0 model can generate a wide variety of images based on text prompts, ranging from photorealistic scenes to more abstract and stylized imagery. The model performs particularly well on tasks like generating artworks, fantasy scenes, and conceptual designs. However, it struggles with more complex tasks involving compositionality, such as rendering an image of a red cube on top of a blue sphere. What can I use it for? The stable-diffusion-xl-base-1.0 model is intended for research purposes, such as: Generation of artworks and use in design and other artistic processes. Applications in educational or creative tools. Research on generative models and their limitations and biases. Safe deployment of models with the potential to generate harmful content. For commercial use, Stability AI provides a membership program, as detailed on their website. Things to try One interesting aspect of the stable-diffusion-xl-base-1.0 model is its ability to generate high-quality images with relatively few inference steps. By using the specialized refinement model or the SDEdit technique, users can achieve impressive results with a more efficient inference process. Additionally, the model's performance can be further optimized by utilizing techniques like CPU offloading or torch.compile, as mentioned in the provided documentation.
Updated 5/28/2024
⚙️
stable-diffusion-2-1
3.7K
The stable-diffusion-2-1 model is a text-to-image generation model developed by Stability AI. It is a fine-tuned version of the stable-diffusion-2 model, with an additional 55k steps on the same dataset and then a further 155k steps with adjusted "unsafety" settings. Similar models include the stable-diffusion-2-1-base which fine-tunes the stable-diffusion-2-base model. Model inputs and outputs The stable-diffusion-2-1 model is a diffusion-based text-to-image generation model that takes text prompts as input and generates corresponding images as output. The text prompts are encoded using a fixed, pre-trained text encoder, and the generated images are 768x768 pixels in size. Inputs Text prompt**: A natural language description of the desired image. Outputs Image**: A 768x768 pixel image generated based on the input text prompt. Capabilities The stable-diffusion-2-1 model can generate a wide variety of images based on text prompts, from realistic scenes to fantastical creations. It demonstrates impressive capabilities in areas like generating detailed and complex images, rendering different styles and artistic mediums, and combining diverse visual elements. However, the model still has limitations in terms of generating fully photorealistic images, rendering legible text, and handling more complex compositional tasks. What can I use it for? The stable-diffusion-2-1 model is intended for research purposes only. Possible use cases include generating artworks and designs, creating educational or creative tools, and probing the limitations and biases of generative models. The model should not be used to intentionally create or disseminate images that could be harmful, offensive, or propagate stereotypes. Things to try One interesting aspect of the stable-diffusion-2-1 model is its ability to generate images with different styles and artistic mediums based on the text prompt. For example, you could try prompts that combine realistic elements with more fantastical or stylized components, or experiment with prompts that evoke specific artistic movements or genres. The model's performance may also vary depending on the language and cultural context of the prompt, so exploring prompts in different languages could yield interesting results.
Updated 5/28/2024
✨
stable-video-diffusion-img2vid-xt
2.3K
The stable-video-diffusion-img2vid-xt model is a diffusion-based generative model developed by Stability AI that takes in a still image and generates a short video clip from it. It is an extension of the SVD Image-to-Video model, generating 25 frames at a resolution of 576x1024 compared to the 14 frames of the earlier model. This model was trained on a large dataset and finetuned to improve temporal consistency and video quality. Model inputs and outputs The stable-video-diffusion-img2vid-xt model takes in a single image as input and generates a short video clip as output. The input image must be 576x1024 pixels in size. Inputs Image**: A 576x1024 pixel image that serves as the conditioning frame for the video generation. Outputs Video**: A 25 frame video clip at 576x1024 resolution, generated from the input image. Capabilities The stable-video-diffusion-img2vid-xt model is capable of generating short, high-quality video clips from a single input image. It is able to capture movement, action, and dynamic scenes based on the content of the conditioning image. While it does not achieve perfect photorealism, the generated videos demonstrate impressive temporal consistency and visual fidelity. What can I use it for? The stable-video-diffusion-img2vid-xt model is intended for research purposes, such as exploring generative models, probing the limitations of video generation, and developing artistic or creative applications. It could be used to generate dynamic visual content for design, educational, or entertainment purposes. However, the model should not be used to generate content that is harmful, misleading, or in violation of Stability AI's Acceptable Use Policy. Things to try One interesting aspect of the stable-video-diffusion-img2vid-xt model is its ability to generate video from a single image, capturing a sense of motion and dynamism that goes beyond the static source. Experimenting with different types of input images, such as landscapes, portraits, or abstract compositions, could lead to a diverse range of video outputs that showcase the model's flexibility and creativity. Additionally, you could try varying the prompt or conditioning parameters to see how the model responds and explore the limits of its capabilities.
Updated 5/28/2024
🔍
sdxl-turbo
2.1K
sdxl-turbo is a fast generative text-to-image model developed by Stability AI. It is a distilled version of the SDXL 1.0 Base model, trained using a novel technique called Adversarial Diffusion Distillation (ADD) to enable high-quality image synthesis in just 1-4 steps. This approach leverages a large-scale off-the-shelf image diffusion model as a teacher signal and combines it with an adversarial loss to ensure high fidelity even with fewer sampling steps. Model Inputs and Outputs sdxl-turbo is a text-to-image generative model. It takes a text prompt as input and generates a corresponding photorealistic image as output. The model is optimized for real-time synthesis, allowing for fast image generation from a text description. Inputs Text prompt describing the desired image Outputs Photorealistic image generated based on the input text prompt Capabilities sdxl-turbo is capable of generating high-quality, photorealistic images from text prompts in a single network evaluation. This makes it suitable for real-time, interactive applications where fast image synthesis is required. What Can I Use It For? With sdxl-turbo's fast and high-quality image generation capabilities, you can explore a variety of applications, such as interactive art tools, visual storytelling platforms, or even prototyping and visualization for product design. The model's real-time performance also makes it well-suited for use in live demos or AI-powered creative assistants. For commercial use, please refer to Stability AI's membership options. Things to Try One interesting aspect of sdxl-turbo is its ability to generate images with a high degree of fidelity using just 1-4 sampling steps. This makes it possible to experiment with rapid image synthesis, where the user can quickly generate and iterate on visual ideas. Try exploring different text prompts and observe how the model's output changes with the number of sampling steps.
Updated 5/28/2024
🏋️
stable-diffusion-2
1.8K
The stable-diffusion-2 model is a diffusion-based text-to-image generation model developed by Stability AI. It is an improved version of the original Stable Diffusion model, trained for 150k steps using a v-objective on the same dataset as the base model. The model is capable of generating high-resolution images (768x768) from text prompts, and can be used with the stablediffusion repository or the diffusers library. Similar models include the SDXL-Turbo and Stable Cascade models, which are also developed by Stability AI. The SDXL-Turbo model is a distilled version of the SDXL 1.0 model, optimized for real-time synthesis, while the Stable Cascade model uses a novel multi-stage architecture to achieve high-quality image generation with a smaller latent space. Model inputs and outputs Inputs Text prompt**: A text description of the desired image, which the model uses to generate the corresponding image. Outputs Image**: The generated image based on the input text prompt, with a resolution of 768x768 pixels. Capabilities The stable-diffusion-2 model can be used to generate a wide variety of images from text prompts, including photorealistic scenes, imaginative concepts, and abstract compositions. The model has been trained on a large and diverse dataset, allowing it to handle a broad range of subject matter and styles. Some example use cases for the model include: Creating original artwork and illustrations Generating concept art for games, films, or other media Experimenting with different visual styles and aesthetics Assisting with visual brainstorming and ideation What can I use it for? The stable-diffusion-2 model is intended for both non-commercial and commercial usage. For non-commercial or research purposes, you can use the model under the CreativeML Open RAIL++-M License. Possible research areas and tasks include: Research on generative models Research on the impact of real-time generative models Probing and understanding the limitations and biases of generative models Generation of artworks and use in design and other artistic processes Applications in educational or creative tools For commercial use, please refer to https://stability.ai/membership. Things to try One interesting aspect of the stable-diffusion-2 model is its ability to generate highly detailed and photorealistic images, even for complex scenes and concepts. Try experimenting with detailed prompts that describe intricate settings, characters, or objects, and see the model's ability to bring those visions to life. Additionally, you can explore the model's versatility by generating images in a variety of styles, from realism to surrealism, impressionism to expressionism. Experiment with different artistic styles and see how the model interprets and renders them.
Updated 5/28/2024
⛏️
stable-diffusion-xl-refiner-1.0
1.5K
The stable-diffusion-xl-refiner-1.0 model is a diffusion-based text-to-image generative model developed by Stability AI. It is part of the SDXL model family, which consists of an ensemble of experts pipeline for latent diffusion. The base model is used to generate initial latents, which are then further processed by a specialized refinement model to produce the final high-quality image. The model can be used in two ways - either through a single-stage pipeline that uses the base and refiner models together, or a two-stage pipeline that first generates latents with the base model and then applies the refiner model. The two-stage approach is slightly slower but can produce even higher quality results. Similar models in the SDXL family include the sdxl-turbo and sdxl models, which offer different trade-offs in terms of speed, quality, and ease of use. Model Inputs and Outputs Inputs Text prompt**: A natural language description of the desired image. Outputs Image**: A high-quality generated image matching the provided text prompt. Capabilities The stable-diffusion-xl-refiner-1.0 model can generate photorealistic images from text prompts covering a wide range of subjects and styles. It excels at producing detailed, visually striking images that closely align with the provided description. What Can I Use It For? The stable-diffusion-xl-refiner-1.0 model is intended for both non-commercial and commercial usage. Possible applications include: Research on generative models**: Studying the model's capabilities, limitations, and biases can provide valuable insights for the field of AI-generated content. Creative and artistic processes**: The model can be used to generate unique and inspiring images for use in design, illustration, and other artistic endeavors. Educational tools**: The model could be integrated into educational applications to foster creativity and visual learning. For commercial use, please refer to the Stability AI membership page. Things to Try One interesting aspect of the stable-diffusion-xl-refiner-1.0 model is its ability to produce high-quality images through a two-stage process. Try experimenting with both the single-stage and two-stage pipelines to see how the results differ in terms of speed, quality, and other characteristics. You may find that the two-stage approach is better suited for certain types of prompts or use cases. Additionally, explore how the model handles more complex or abstract prompts, such as those involving multiple objects, scenes, or concepts. The model's performance on these types of prompts can provide insights into its understanding of language and compositional reasoning.
Updated 5/28/2024
🤯
stable-diffusion-xl-base-0.9
1.4K
The stable-diffusion-xl-base-0.9 model is a text-to-image generative model developed by Stability AI. It is a Latent Diffusion Model that uses two fixed, pretrained text encoders (OpenCLIP-ViT/G and CLIP-ViT/L). The model consists of a two-step pipeline for latent diffusion - first generating latents of the desired output size, then refining them using a specialized high-resolution model and a technique called SDEdit (https://arxiv.org/abs/2108.01073). This model builds upon the capabilities of previous Stable Diffusion models, improving image quality and prompt following. Model inputs and outputs Inputs Prompt**: A text description of the desired image to generate. Outputs Image**: A 512x512 pixel image generated based on the input prompt. Capabilities The stable-diffusion-xl-base-0.9 model can generate a wide variety of images based on text prompts, from realistic scenes to fantastical creations. It performs significantly better than previous Stable Diffusion models in terms of image quality and prompt following, as demonstrated by user preference evaluations. The model can be particularly useful for tasks like artwork generation, creative design, and educational applications. What can I use it for? The stable-diffusion-xl-base-0.9 model is intended for research purposes, such as generation of artworks, applications in educational or creative tools, research on generative models, and probing the limitations and biases of the model. While the model is not suitable for generating factual or true representations of people or events, it can be a powerful tool for artistic expression and exploration. For commercial use, please refer to Stability AI's membership options. Things to try One interesting aspect of the stable-diffusion-xl-base-0.9 model is its ability to generate high-quality images using a two-step pipeline. Try experimenting with different combinations of the base model and refinement model to see how the results vary in terms of image quality, detail, and prompt following. You can also explore the model's capabilities in generating specific types of imagery, such as surreal or fantastical scenes, and see how it handles more complex prompts involving compositional elements.
Updated 4/29/2024
🤖
sd-vae-ft-mse-original
1.3K
The sd-vae-ft-mse-original model is an improved autoencoder developed by the Stability AI team. It is a fine-tuned version of the original kl-f8 autoencoder used in the Stable Diffusion model. The team fine-tuned the decoder on a 1:1 ratio of LAION-Aesthetics and LAION-Humans datasets to improve the reconstruction of faces. Two versions were released - ft-EMA which uses exponential moving average (EMA) weights, and ft-MSE which emphasizes mean squared error (MSE) reconstruction over the original L1 and LPIPS loss. The sd-vae-ft-mse-original model shows improvements over the original kl-f8 autoencoder in terms of PSNR, SSIM, and PSIM metrics on the COCO 2017 and LAION-Aesthetics datasets. The ft-MSE version in particular produces "smoother" outputs compared to the original. Model inputs and outputs Inputs Images of various sizes (originally trained on 256x256 but can handle higher resolutions) Outputs Reconstructed images from the model's latent representation Evaluation metrics like rFID, PSNR, SSIM, and PSIM to assess reconstruction quality Capabilities The sd-vae-ft-mse-original model is an improved autoencoder that can be used as a drop-in replacement for the original kl-f8 autoencoder used in Stable Diffusion. It shows better performance on reconstruction tasks, especially for faces and human subjects, due to the fine-tuning on the LAION-Humans dataset. What can I use it for? The sd-vae-ft-mse-original model can be used in the original CompVis Stable Diffusion codebase as a replacement for the autoencoder. This can potentially improve the quality and realism of generated images, especially those involving human subjects. Things to try Researchers and developers can experiment with the different fine-tuned versions of the autoencoder (ft-EMA and ft-MSE) to see how they impact the performance and output quality of the Stable Diffusion model. The smoother outputs of the ft-MSE version may be beneficial for certain use cases.
Updated 5/28/2024
🔍
stable-cascade
1.2K
Stable Cascade is a diffusion model developed by Stability AI that is capable of generating images from text prompts. It is built upon the Wrstchen architecture and achieves a significantly higher compression factor compared to Stable Diffusion. While Stable Diffusion encodes a 1024x1024 image to 128x128, Stable Cascade is able to encode it to just 24x24 while maintaining crisp reconstructions. This allows for faster inference and cheaper training, making it well-suited for use cases where efficiency is important. The model consists of three stages - Stage A, Stage B and Stage C - with Stage A and B handling the compression and Stage C generating the final image from the compressed latent representation. Model inputs and outputs Stable Cascade is a generative text-to-image model. It takes a text prompt as input and generates a corresponding image as output. Inputs Text prompt describing the desired image Outputs An image generated based on the input text prompt Capabilities Stable Cascade is capable of generating high-quality images from text prompts in a highly compressed latent space, allowing for faster and more cost-effective model inference compared to other text-to-image models like Stable Diffusion. The model is well-suited for use cases where efficiency is important, and can also be fine-tuned or extended using techniques like LoRA, ControlNet, and IP-Adapter. What can I use it for? The Stable Cascade model can be used for a variety of applications where generating images from text prompts is useful, such as: Creative art and design projects Prototyping and visualization Educational and research purposes Development of real-time generative applications Due to its efficient architecture, the model is particularly well-suited for use cases where processing speed and cost are important factors, such as in mobile or edge computing applications. Things to try One interesting aspect of the Stable Cascade model is its highly compressed latent space representation. You could experiment with this by trying to generate images from prompts using only the small 24x24 latent representations, and see how the image quality and fidelity to the prompt compare to using the full-resolution input. Additionally, you could explore how the model's performance and capabilities change when fine-tuned or extended using techniques like LoRA, ControlNet, and IP-Adapter, as the maintainers suggest these extensions are possible with the Stable Cascade architecture.
Updated 5/28/2024
🎲
StableBeluga2
884
Stable Beluga 2 is a Llama2 70B model finetuned by Stability AI on an Orca-style dataset. It is part of a family of Beluga models, with other variants including StableBeluga 1 - Delta, StableBeluga 13B, and StableBeluga 7B. These models are designed to be highly capable language models that follow instructions well and provide helpful, safe, and unbiased assistance. Model inputs and outputs Stable Beluga 2 is an autoregressive language model that takes text as input and generates text as output. It can be used for a variety of natural language processing tasks, such as text generation, summarization, and question answering. Inputs Text prompts Outputs Generated text Responses to questions or instructions Capabilities Stable Beluga 2 is a highly capable language model that can engage in open-ended dialogue, answer questions, and assist with a variety of tasks. It has been trained to follow instructions carefully and provide helpful, safe, and unbiased responses. The model performs well on benchmarks for commonsense reasoning, world knowledge, and other important language understanding capabilities. What can I use it for? Stable Beluga 2 can be used for a variety of applications, such as: Building conversational AI assistants Generating creative writing or content Answering questions and providing information Summarizing text Providing helpful instructions and advice The model's strong performance on safety and helpfulness benchmarks make it well-suited for use cases that require a reliable and trustworthy AI assistant. Things to try Some interesting things to try with Stable Beluga 2 include: Engaging the model in open-ended dialogue to see the breadth of its conversational abilities Asking it to provide step-by-step instructions for completing a task Prompting it to generate creative stories or poems Evaluating its performance on specific language understanding benchmarks or tasks The model's flexibility and focus on safety and helpfulness make it a compelling choice for a wide range of natural language processing applications.
Updated 5/27/2024