[](#altdiffusion)AltDiffusion
=============================

 Name

 Task

 Language(s)

 Model

Github

AltDiffusion-m9

 Multimodal

Multilingual

Stable Diffusion

[FlagAI](https://github.com/FlagAI-Open/FlagAI)

[](#gradio)Gradio
=================

We support a [Gradio](https://github.com/gradio-app/gradio) Web UI to run AltDiffusion-m9: [![Open In Spaces](https://camo.githubusercontent.com/00380c35e60d6b04be65d3d94a58332be5cc93779f630bcdfc18ab9a3a7d3388/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f25463025394625413425393725323048756767696e67253230466163652d5370616365732d626c7565)](https://huggingface.co/spaces/akhaliq/AltDiffusion-m9)

[](#-model-information) Model Information
=================================================

 [AltCLIP-m9](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltCLIP/README.md) [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) Diffusion [WuDao](https://data.baai.ac.cn/details/WuDaoCorporaText)  [LAION](https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus) 

stable diffusion

AltDiffusion-m9  AltCLIP-m9  CLIP  [](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltCLIP/README.md) 

We used [AltCLIP-m9](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltCLIP/README.md), and trained a bilingual Diffusion model based on [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion), with training data from [WuDao dataset](https://data.baai.ac.cn/details/WuDaoCorporaText) and [LAION](https://huggingface.co/datasets/laion/laion2B-en).

Our model performs well in aligning multilanguage and is the strongest open-source version on the market today, retaining most of the stable diffusion capabilities of the original, and in some cases even better than the original model.

AltDiffusion-m9 model is backed by a multilingual CLIP model named AltCLIP-m9, which is also accessible in FlagAI. You can read [this tutorial](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltCLIP/README.md) for more information.

[](#)
---------

AltCLIP-m9

If you find this work helpful, please consider to cite

    @article{https://doi.org/10.48550/arxiv.2211.06679,
      doi = {10.48550/ARXIV.2211.06679},
      url = {https://arxiv.org/abs/2211.06679},
      author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
      keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
      title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
      publisher = {arXiv},
      year = {2022},
      copyright = {arXiv.org perpetual, non-exclusive license}
    }
    

[](#-model-weights) Model Weights
=========================================

AltDiffusion-m9huggingface,

The following weights are automatically downloaded from HF when the AltDiffusion-m9 model is run for the first time:

 Model name

 Size

 Description

StableDiffusionSafetyChecker

1.13G

Safety checker for image

AltDiffusion-m9

8.0G

support English(En), Chinese(Zh), Spanish(Es), French(Fr), Russian(Ru), Japanese(Ja), Korean(Ko), Arabic(Ar) and Italian(It)

AltCLIP-m9

3.22G

support English(En), Chinese(Zh), Spanish(Es), French(Fr), Russian(Ru), Japanese(Ja), Korean(Ko), Arabic(Ar) and Italian(It)

[](#-example) Example
=========================

[](#diffusers-example)Diffusers Example
-------------------------------------------

**AltDiffusion-m9**  Diffusers!

[](https://colab.research.google.com/drive/1htPovT5YNutl2i31mIYrOzlIgGLm06IX#scrollTo=1TrIQp9N1Bnm)colab

 [](https://huggingface.co/docs/diffusers/main/en/api/pipelines/alt_diffusion) 

fast DPM , V100  2 

You can run our diffusers example through [here](https://colab.research.google.com/drive/1htPovT5YNutl2i31mIYrOzlIgGLm06IX#scrollTo=1TrIQp9N1Bnm) in colab.

You can see the documentation page [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/alt_diffusion).

The following example will use the fast DPM scheduler to generate an image in ca. 2 seconds on a V100.

First you should install diffusers main branch and some dependencies:

    pip install git+https://github.com/huggingface/diffusers.git torch transformers accelerate sentencepiece
    

then you can run the following example:

    from diffusers import AltDiffusionPipeline, DPMSolverMultistepScheduler
    import torch
    
    pipe = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9", torch_dtype=torch.float16, revision="fp16")
    pipe = pipe.to("cuda")
    
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
    
    prompt = ""
    # or in English:
    # prompt = "dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap"
    
    image = pipe(prompt, num_inference_steps=25).images[0]
    image.save("./alt.png")
    

[![alt](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/hub/alt.png)](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/hub/alt.png)

[](#transformers-example)Transformers Example
---------------------------------------------

    import os
    import torch
    import transformers
    from transformers import BertPreTrainedModel
    from transformers.models.clip.modeling_clip import CLIPPreTrainedModel
    from transformers.models.xlm_roberta.tokenization_xlm_roberta import XLMRobertaTokenizer
    from diffusers.schedulers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler
    from diffusers import StableDiffusionPipeline
    from transformers import BertPreTrainedModel,BertModel,BertConfig
    import torch.nn as nn
    import torch
    from transformers.models.xlm_roberta.configuration_xlm_roberta import XLMRobertaConfig
    from transformers import XLMRobertaModel
    from transformers.activations import ACT2FN
    from typing import Optional
    
    
    class RobertaSeriesConfig(XLMRobertaConfig):
        def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2,project_dim=768,pooler_fn='cls',learn_encoder=False, **kwargs):
            super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
            self.project_dim = project_dim
            self.pooler_fn = pooler_fn
            # self.learn_encoder = learn_encoder
    
    class RobertaSeriesModelWithTransformation(BertPreTrainedModel):
        _keys_to_ignore_on_load_unexpected = [r"pooler"]
        _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
        base_model_prefix = 'roberta'
        config_class= XLMRobertaConfig
        def __init__(self, config):
            super().__init__(config)
            self.roberta = XLMRobertaModel(config)
            self.transformation = nn.Linear(config.hidden_size, config.project_dim)
            self.post_init()
            
        def get_text_embeds(self,bert_embeds,clip_embeds):
            return self.merge_head(torch.cat((bert_embeds,clip_embeds)))
    
        def set_tokenizer(self, tokenizer):
            self.tokenizer = tokenizer
    
        def forward(self, input_ids: Optional[torch.Tensor] = None) :
            attention_mask = (input_ids != self.tokenizer.pad_token_id).to(torch.int64)
            outputs = self.base_model(
                input_ids=input_ids,
                attention_mask=attention_mask,
            )
            
            projection_state = self.transformation(outputs.last_hidden_state)
            
            return (projection_state,)
    
    model_path_encoder = "BAAI/RobertaSeriesModelWithTransformation"
    model_path_diffusion = "BAAI/AltDiffusion-m9"
    device = "cuda"
    
    seed = 12345
    tokenizer = XLMRobertaTokenizer.from_pretrained(model_path_encoder, use_auth_token=True)
    tokenizer.model_max_length = 77
    
    text_encoder = RobertaSeriesModelWithTransformation.from_pretrained(model_path_encoder, use_auth_token=True)
    text_encoder.set_tokenizer(tokenizer)
    print("text encode loaded")
    pipe = StableDiffusionPipeline.from_pretrained(model_path_diffusion,
                                                   tokenizer=tokenizer,
                                                   text_encoder=text_encoder,
                                                   use_auth_token=True,
                                                   )
    print("diffusion pipeline loaded")
    pipe = pipe.to(device)
    
    prompt = "Thirty years old lee evans as a sad 19th century postman. detailed, soft focus, candle light, interesting lights, realistic, oil canvas, character concept art by munkcsy mihly, csk istvn, john everett millais, henry meynell rheam, and da vinci"
    with torch.no_grad():
        image = pipe(prompt, guidance_scale=7.5).images[0]  
        
    image.save("3.png")
    

`predict_generate_images`:

More parameters of predict\_generate\_images for you to adjust for `predict_generate_images` are listed below:

 Parameter

 Type

 Description

prompt

str

; The prompt text

out\_path

str

; The output path to save images

n\_samples

int

; Number of images to be generate

skip\_grid

bool

True, ; If set to true, image gridding step will be skipped

ddim\_step

int

DDIM; Number of steps in ddim model

plms

bool

True, plms; If set to true, PLMS Sampler instead of DDIM Sampler will be applied

scale

float

; This value determines how important the prompt incluences generate images

H

int

; Height of image

W

int

; Width of image

C

int

channel; Numeber of channels of generated images

seed

int

; Random seed number

10GGPU

Note that the model inference requires a GPU of at least 10G above.

[](#-more-results) More Results
===========================================

[](#multilanguage-examples)multilanguage examples
-------------------------------------------------

prompts

One prompts in different languages generates different faces! [![image](/BAAI/AltDiffusion-m9/resolve/main/m9.png)](/BAAI/AltDiffusion-m9/blob/main/m9.png)

[](#-chinese-and-english-alignment-ability) Chinese and English alignment ability
-----------------------------------------------------------------------------------------------

### [](#promptdark-elf-princess-highly-detailed-d--d-fantasy-highly-detailed-digital-painting-trending-on-artstation-concept-art-sharp-focus-illustration-art-by-artgerm-and-greg-rutkowski-and-fuji-choko-and-viktoria-gavrilenko-and-hoang-lap)prompt:dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap

### [](#generated-results-from-english-prompts)/Generated results from English prompts

[![image](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/en_dark_elf.png)](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/en_dark_elf.png)

### [](#prompt)prompt:

### [](#generated-results-from-chinese-prompts)/Generated results from Chinese prompts

[![image](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/cn_dark_elf.png)](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/cn_dark_elf.png)

[](#the-performance-for-chinese-prompts)/The performance for Chinese prompts
----------------------------------------------------------------------------------------

[](#prompt8k)prompt:8K
------------------------------------------------------

[![image](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/boy.png)](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/boy.png)

[](#prompt8k)prompt:8K
----------------------------------------------------------

[![image](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/cn_boy.png)](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/cn_boy.png)

[](#the-ability-to-generate-long-images)/The ability to generate long images
----------------------------------------------------------------------------------------

### [](#prompt-)prompt: 

### [](#-stable-diffusion) stable diffusion

[![image](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/dog_other.png)](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/dog_other.png)

### [](#ours)Ours:

[![image](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/dog.png)](https://raw.githubusercontent.com/BAAI-OpenPlatform/test_open/main/dog.png)

: (RightBrain AI)

Note: The long image generation technology here is provided by Right Brain Technology.

[](#number-of-model-parameters)/Number of Model Parameters
====================================================================

 Module Name

 Number of Parameters

AutoEncoder

83.7M

Unet

865M

AltCLIP-m9 TextEncoder

859M

[](#citation)/Citation
==========================

Please cite our paper if you find it helpful :)

    @misc{ye2023altdiffusion,
          title={AltDiffusion: A Multilingual Text-to-Image Diffusion Model}, 
          author={Fulong Ye and Guang Liu and Xinya Wu and Ledell Wu},
          year={2023},
          eprint={2308.09991},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
    }
    

[](#license)/License
========================

 [CreativeML Open RAIL-M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) [](https://huggingface.co/spaces/CompVis/stable-diffusion-license) 

The model is licensed with a [CreativeML Open RAIL-M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license). The authors claim no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in this license. The license forbids you from sharing any content that violates any laws, produce any harm to a person, disseminate any personal information that would be meant for harm, spread misinformation and target vulnerable groups. You can modify and use the model for commercial purposes, but a copy of the same use restrictions must be included. For the full list of restrictions please [read the license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) .

## Model overview

`AltDiffusion-m9` is a multimodal, multilingual diffusion model developed by BAAI. It is based on the Stable Diffusion architecture and has been trained on the [WuDao](https://data.baai.ac.cn/details/WuDaoCorporaText) and [LAION](https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus) datasets. The model uses a bilingual [AltCLIP-m9](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltCLIP/README.md) text encoder, allowing it to generate high-quality images from prompts in multiple languages. Compared to the original Stable Diffusion model, AltDiffusion-m9 retains most of the capabilities while also demonstrating improved performance in certain areas.

## Model inputs and outputs

### Inputs
- **Text prompts**: AltDiffusion-m9 accepts text prompts as input, which can be in multiple languages. The model uses a bilingual text encoder to process the prompts and generate corresponding images.

### Outputs
- **Generated images**: The model outputs high-quality, photorealistic images based on the input text prompts. The images can depict a wide range of subjects, from realistic scenes to more abstract and imaginative compositions.

## Capabilities

AltDiffusion-m9 is a powerful text-to-image generation model that can create detailed and visually striking images from a variety of prompts. The model's multilingual capabilities allow it to generate high-quality images from prompts in languages other than English, making it a valuable tool for users with diverse linguistic backgrounds.

## What can I use it for?

The versatility of AltDiffusion-m9 makes it suitable for a wide range of applications, including:

- **Creative projects**: Designers, artists, and content creators can use the model to generate unique and inspiring visuals for their work.
- **Multilingual applications**: The model's language-agnostic capabilities make it useful for developing applications that cater to global audiences.
- **Educational tools**: Educators can leverage the model to create engaging educational materials and visualizations for their students.
- **Research and development**: Researchers working on generative AI models or image synthesis can use AltDiffusion-m9 as a baseline or starting point for their experiments.

## Things to try

One interesting aspect of AltDiffusion-m9 is its ability to generate high-quality images from prompts in multiple languages. Try experimenting with prompts in different languages, such as Chinese, Japanese, or Spanish, and observe how the model responds. You can also try combining the model with other tools, such as text-to-speech or natural language processing, to create more immersive and interactive experiences.

Another interesting approach is to use AltDiffusion-m9 for image-to-image translation tasks, where you can provide the model with an existing image and a text prompt to generate a new, transformed image. This could be useful for tasks like photo editing, artistic style transfer, or even image-based storytelling.