[](#-introduction) Introduction
===================================

*   Tech Report : [https://arxiv.org/abs/2401.14688](https://arxiv.org/abs/2401.14688)
*   Demo : [https://huggingface.co/spaces/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B](https://huggingface.co/spaces/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B)
*   Train Code [https://github.com/IDEA-CCNL/Taiyi-Diffusion-XL](https://github.com/IDEA-CCNL/Taiyi-Diffusion-XL)
*   Deployment Webui : [https://github.com/IDEA-CCNL/Fooocus-Taiyi-XL](https://github.com/IDEA-CCNL/Fooocus-Taiyi-XL)

[![prompt](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/resolve/main/imgs/high-resolution.jpg)](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/blob/main/imgs/high-resolution.jpg)

ImagenOpenAIDALL-E 3Stability AIStable DiffusionAIGCSD v1.5Taiyi-Diffusion-v0.1Alt-DiffusionAITaiyi-Diffusion-XLTaiyi-XL

The surge in text-to-image models like Google's Imagen, OpenAI's DALL-E 3, and Stability AI's Stable Diffusion has revolutionized digital art creation. However, the effectiveness of Chinese text-to-image models, such as taiyi-diffusion-v0.1 and alt-diffusion based on SD v1.5, remains moderate. Many AI art platforms in China support only English or rely on Chinese-to-English translation tools. Current open-source text-to-image models predominantly support English, with limited bilingual capabilities. Our work, Taiyi-Diffusion-XL (Taiyi-XL), builds on these developments, focusing on enhancing Chinese text-to-image generation while retaining English proficiency, addressing the unique challenges of bilingual language processing.

[](#-model-training) Model Training
===========================================

[![Taiyi-Diffusion-XL](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/resolve/main/imgs/overview_00.png)](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/blob/main/imgs/overview_00.png)

Taiyi-Diffusion-XL3-captionCLIPStable-Diffusion-XLtext encoder

The training of the Taiyi-Diffusion-XL text-to-image model encompasses three main stages. Initially, we created a high-quality dataset of image-text pairs, with each image accompanied by a detailed descriptive text. To overcome the limitations of web-crawled data, we employed advanced vision-language large models to generate accurate captions that precisely describe the images. This approach enriched our dataset, ensuring relevance and detail. Subsequently, we began with a pre-trained English CLIP model and expanded its vocabulary and position encoding to better support Chinese and longer texts. This expansion was achieved through training on a large-scale bilingual dataset, utilizing a contrastive loss function and a memory-efficient approach. Finally, based on Stable-Diffusion-XL, we replaced the text encoder obtained in the second stage and conducted multi-resolution, aspect-ratio-variant training of the diffusion model on the dataset prepared in the first stage. This comprehensive training process ensures that Taiyi-Diffusion-XL effectively supports bilingual text-to-image generation, catering to diverse linguistic and visual requirements.

[](#-model-evaluation) Model Evaluation
===============================================

[](#-machine-evaluation) Machine Evaluation
---------------------------------------------------

CLIPCLIP SimISFIDCOCOTaiyi-XLCLIP SimISFIDTaiyi-XLCOCO-CNTaiyi-XL

Our machine evaluation involved a comprehensive comparison of various models. The evaluation metrics included CLIP Similarity (CLIP Sim), Inception Score (IS), and Frchet Inception Distance (FID), providing a robust assessment of each model's performance in terms of image quality, diversity, and alignment with textual descriptions. In the English dataset (COCO), Taiyi-XL demonstrated superior performance across all metrics, achieving the highest scores in CLIP Sim, IS, and FID. This indicates Taiyi-XL's effectiveness in generating images closely aligned with English text prompts while maintaining high image quality and diversity. Similarly, in the Chinese dataset (COCO-CN), Taiyi-XL outperformed other models, showcasing its robust bilingual capabilities.

#### [](#table-comparison-of-different-models-based-on-clip-sim-is-and-fid-across-english-coco-and-chinese-coco-cn-datasets)Table: Comparison of different models based on CLIP Sim, IS, and FID across English (COCO) and Chinese (COCO-CN) datasets

Model

CLIP Sim($\\uparrow$)

FID($\\downarrow$)

IS($\\uparrow$)

**English Dataset (COCO)**

Alt-Diffusion

0.220

27.600

31.577

SD-v1.5

0.225

25.342

32.876

SD-XL

0.231

23.887

33.793

Taiyi-XL

**0.254**

**22.543**

**35.465**

**Chinese Dataset (COCO-CN)**

Taiyi-v0.1

0.197

69.226

21.060

Alt-Diffusion

0.220

68.488

22.126

Pai-Diffusion

0.196

72.572

19.145

Taiyi-XL

**0.225**

**67.675**

**22.965**

_The best results are marked in **bold**._

[](#-human-preference-evaluation) Human Preference Evaluation
-------------------------------------------------------------------------

XLSD-XLTaiyi-XL1.5SD-v1.5Alt-DiffusionDALL-E 3prompt-followingTaiyi-XLMidjourneyTaiyi-XL

As shown in the figures below, a comparison of different models in Chinese and English text-to-image generation performance is presented. The XL version models, such as SD-XL and Taiyi-XL, show significant improvements over the 1.5 version models like SD-v1.5 and Alt-Diffusion. DALL-E 3 is renowned for its vibrant colors and its ability to closely follow text prompts, setting a high standard. Our Taiyi-XL model, with its photographic style, closely matches the performance of Midjourney and excels in bilingual (Chinese and English) text-to-image generation.

Taiyi-XLMidjourneyDALL-E 3AIGC****

Although Taiyi-XL may not yet rival commercial models, it excels among current bilingual open-source models. The gap with commercial models is mainly due to differences in the quantity, quality, and diversity of training data. Our model is trained exclusively on copyright-compliant image-text data. We dont't use AI generated image such as Midjoueney or DALL-E 3. As is well known, copyright issues remain the biggest challenge in text-to-image and AI-generated content (AIGC) models.

[![](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/resolve/main/imgs/zh_compare.png)](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/blob/main/imgs/zh_compare.png)

[![](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/resolve/main/imgs/en_compare.png)](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/blob/main/imgs/en_compare.png)

LCM81LCM

We also evaluated the impact of using Latent Consistency Models (LCM) to accelerate the image generation process. The tests showed that as the number of inference steps decreases, the image quality declines. Extending the generation process to 8 steps generally ensures the quality of the generated images; when limited to a single step, the images mainly display basic outlines and lack finer details. This finding suggests that while LCM can effectively speed up the generation process, a balance must be struck between the number of steps and the desired image quality.

[![Taiyi-XLLCM - ](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/resolve/main/imgs/lcm.png)](/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/blob/main/imgs/lcm.png)

[](#-citation) Citation
---------------------------



If you are using the resource for your work, please cite the our paper:

    @misc{wu2024taiyidiffusionxl,
          title={Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support}, 
          author={Xiaojun Wu and Dixiang Zhang and Ruyi Gan and Junyu Lu and Ziwei Wu and Renliang Sun and Jiaxing Zhang and Pingjian Zhang and Yan Song},
          year={2024},
          eprint={2401.14688},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }
    

    @article{fengshenbang,
      author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
      title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
      journal   = {CoRR},
      volume    = {abs/2209.02970},
      year      = {2022}
    }

## Model overview

The `Taiyi-Stable-Diffusion-XL-3.5B` is a powerful text-to-image model developed by IDEA-CCNL that builds upon the foundations of models like Google's Imagen and OpenAI's DALL-E 3. Unlike previous Chinese text-to-image models, which had moderate effectiveness, Taiyi-XL focuses on enhancing Chinese text-to-image generation while retaining English proficiency. This addresses the unique challenges of bilingual language processing.

The training of the Taiyi-Diffusion-XL model involved several key stages. First, a high-quality dataset of image-text pairs was created, with advanced vision-language models generating accurate captions to enrich the dataset. Then, the model expanded the vocabulary and position encoding of a pre-trained English CLIP model to better support Chinese and longer texts. Finally, based on Stable-Diffusion-XL, the text encoder was replaced, and multi-resolution, aspect-ratio-variant training was conducted on the prepared dataset.

Similar models include the [Taiyi-Stable-Diffusion-1B-Chinese-v0.1](https://aimodels.fyi/models/huggingFace/taiyi-stable-diffusion-1b-chinese-v01-idea-ccnl), which was the first open-source Chinese Stable Diffusion model, and [AltDiffusion](https://aimodels.fyi/models/huggingFace/altdiffusion-baai), a bilingual text-to-image diffusion model developed by BAAI.

## Model inputs and outputs

### Inputs
- **Prompt**: A text description of the desired image, which can be in English or Chinese.

### Outputs
- **Image**: A visually compelling image generated based on the input prompt.

## Capabilities

The `Taiyi-Stable-Diffusion-XL-3.5B` model excels at generating high-quality, detailed images from both English and Chinese text prompts. It can create a wide range of content, from realistic scenes to fantastical illustrations. The model's bilingual capabilities make it a valuable tool for artists and creators working with both languages.

## What can I use it for?

The `Taiyi-Stable-Diffusion-XL-3.5B` model can be used for a variety of creative and professional applications. Artists and designers can leverage the model to generate concept art, illustrations, and other digital assets. Educators and researchers can use it to explore the capabilities of text-to-image generation and its applications in areas like art, design, and language learning. Developers can integrate the model into creative tools and applications to empower users with powerful image generation capabilities.

## Things to try

One interesting aspect of the `Taiyi-Stable-Diffusion-XL-3.5B` model is its ability to generate high-resolution, long-form images. Try experimenting with prompts that describe complex scenes or panoramic views to see the model's capabilities in this area. You can also explore the model's performance on specific types of images, such as portraits, landscapes, or fantasy scenes, to understand its strengths and limitations.