HunyuanDiT

Maintainer: Tencent-Hunyuan - Last updated 6/13/2024

🔍

Model overview

The HunyuanDiT is a powerful multi-resolution diffusion transformer from Tencent-Hunyuan that showcases fine-grained Chinese language understanding. It builds on the DialogGen multi-modal interactive dialogue system to enable advanced text-to-image generation with Chinese prompts.

The model outperforms similar open-source Chinese text-to-image models like Taiyi-Stable-Diffusion-XL-3.5B and AltDiffusion on key evaluation metrics such as CLIP similarity, Inception Score, and FID. It generates high-quality, diverse images that are well-aligned with Chinese text prompts.

Model inputs and outputs

Inputs

  • Text Prompts: Creative, open-ended text descriptions that express the desired image to generate.

Outputs

  • Generated Images: Visually compelling, high-resolution images that correspond to the given text prompt.

Capabilities

The HunyuanDiT model demonstrates impressive capabilities in Chinese text-to-image generation. It can handle a wide range of prompts, from simple object and scene descriptions to more complex, creative prompts involving fantasy elements, styles, and artistic references. The generated images exhibit detailed, photorealistic rendering as well as vivid, imaginative styles.

What can I use it for?

With its strong performance on Chinese prompts, the HunyuanDiT model opens up exciting possibilities for creative applications targeting Chinese-speaking audiences. Content creators, designers, and AI enthusiasts can leverage this model to generate custom artwork, concept designs, and visualizations for a variety of use cases, such as:

  • Illustrations for publications, websites, and social media
  • Concept art for games, films, and other media
  • Product and packaging design mockups
  • Generative art and experimental digital experiences

The model's multi-resolution capabilities also make it well-suited for use cases requiring different image sizes and aspect ratios.

Things to try

Some interesting things to explore with the HunyuanDiT model include:

  • Experimenting with prompts that combine Chinese and English text to see how the model handles bilingual inputs.
  • Trying out prompts that reference specific artistic styles, genres, or creators to see the model's versatility in emulating different visual aesthetics.
  • Comparing the model's performance to other open-source Chinese text-to-image models, such as the Taiyi-Stable-Diffusion-XL-3.5B and AltDiffusion models.
  • Exploring the potential of the model's multi-resolution capabilities for generating images at different scales and aspect ratios to suit various creative needs.


This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Total Score

349

Follow @aimodelsfyi on 𝕏 →

Related Models

Total Score

48

HunyuanDiT-v1.1

Tencent-Hunyuan

HunyuanDiT-v1.1 is a powerful multi-resolution diffusion transformer developed by Tencent-Hunyuan that demonstrates fine-grained understanding of both English and Chinese. It builds upon the latent diffusion model architecture, using a pre-trained VAE to compress images into a low-dimensional latent space and training a transformer-based diffusion model to generate images from text prompts. The model utilizes a combination of pre-trained bilingual CLIP and multilingual T5 encoders to effectively process text input in both English and Chinese. Similar models like HunyuanDiT and HunyuanCaptioner also leverage Tencent-Hunyuan's expertise in Chinese language understanding and multi-modal generation. However, HunyuanDiT-v1.1 stands out with its improved image quality, reduced watermarking, and accelerated generation speed. Model inputs and outputs Inputs Text prompt**: A natural language description of the desired image, which can include details about objects, scenes, styles, and other attributes. Outputs Generated image**: A high-quality, photorealistic image that matches the provided text prompt. Capabilities HunyuanDiT-v1.1 demonstrates impressive capabilities in generating diverse and detailed images from text prompts, with a strong understanding of both English and Chinese. It can render a wide range of subjects, from realistic scenes to fantastical concepts, and adapts well to various artistic styles, including photographic, painterly, and abstract. The model's advanced language understanding also allows it to process complex, multi-sentence prompts and maintain image-text consistency across multiple generations. What can I use it for? HunyuanDiT-v1.1 can be a powerful tool for a variety of creative and professional applications. Artists and designers can use it to quickly generate concept art, prototypes, or illustrations based on their ideas. Content creators can leverage the model to produce visuals for stories, games, or social media posts. Businesses can explore its potential in areas like product visualization, architectural design, and digital marketing. Things to try One interesting aspect of HunyuanDiT-v1.1 is its ability to handle long, detailed text prompts and maintain a strong level of coherence in the generated images. Try providing the model with prompts that describe complex scenes or narratives, and observe how it translates those ideas into visuals. You can also experiment with incorporating Chinese language elements or blending different styles to see the model's versatility.

Read more

Updated 9/6/2024

Text-to-Image

⛏️

Total Score

46

HunyuanDiT-v1.2

Tencent-Hunyuan

HunyuanDiT-v1.2 is a powerful text-to-image diffusion transformer developed by Tencent-Hunyuan. It builds upon their previous HunyuanDiT-v1.1 model, incorporating fine-grained understanding of both English and Chinese language. The model was carefully designed with a novel transformer structure, text encoder, and positional encoding to enable high-quality bilingual image generation. Compared to similar models like Taiyi-Stable-Diffusion-1B-Chinese-EN-v0.1 and Taiyi-Stable-Diffusion-XL-3.5B, HunyuanDiT-v1.2 demonstrates superior performance in a comprehensive human evaluation, setting a new state-of-the-art in Chinese-to-image generation. Model inputs and outputs Inputs Text prompt**: A textual description of the desired image, which can be in either English or Chinese. Outputs Generated image**: A high-quality image that visually represents the provided text prompt. Capabilities HunyuanDiT-v1.2 excels at generating photorealistic images from a wide range of textual prompts, including those containing Chinese elements and long-form descriptions. The model also supports multi-turn text-to-image generation, allowing users to iteratively refine and build upon the initial image. What can I use it for? With its advanced bilingual capabilities, HunyuanDiT-v1.2 is well-suited for a variety of applications, such as: Creative content generation**: Produce unique, photographic-style artwork and illustrations to enhance creative projects. Localized marketing and advertising**: Generate images tailored to Chinese-speaking audiences for more targeted and effective campaigns. Educational and research applications**: Leverage the model's fine-grained understanding of language to create visual aids and learning materials. Things to try Experiment with HunyuanDiT-v1.2 by generating images from a diverse set of prompts, such as: Prompts that combine Chinese and English elements, like "a cyberpunk-style sports car in the style of traditional Chinese painting" Longer, more detailed prompts that describe complex scenes or narratives Iterative prompts that build upon the previous image, allowing you to refine and expand the generated content By exploring the model's capabilities with a range of input styles, you can unlock its full potential and uncover novel applications for your projects.

Read more

Updated 9/6/2024

Text-to-Image

🤖

Total Score

67

HunyuanCaptioner

Tencent-Hunyuan

The HunyuanCaptioner model is a text-to-image captioning model developed by Tencent-Hunyuan. It builds upon the LLaVA implementation to generate high-quality image descriptions from a variety of angles, including object description, object relationships, background information, and image style. The model maintains a high degree of image-text consistency, making it well-suited for text-to-image techniques. Model Inputs and Outputs The HunyuanCaptioner model takes image files as inputs and generates textual descriptions of the image content. The model supports different prompt templates for generating captions in either Chinese or English, as well as the ability to insert specific knowledge into the captions. Inputs Image files Outputs Textual descriptions of the image content Captions in Chinese or English Captions with inserted knowledge Capabilities The HunyuanCaptioner model demonstrates strong capabilities in generating detailed and consistent image captions. It can describe the objects in an image, their relationships, the background, and the overall style of the image. The model's performance has been evaluated and compared to other open-source text-to-image models, showing it sets a new state-of-the-art in Chinese-to-image generation. What Can I Use It For? The HunyuanCaptioner model can be used in a variety of applications that require generating textual descriptions of images, such as: Automated image captioning for social media or e-commerce platforms Enhancing the accessibility of visual content for visually impaired users Generating captions for educational or training materials Integrating text-to-image capabilities into chatbots or virtual assistants HunyuanDiT, another model developed by Tencent-Hunyuan, is a powerful multi-resolution diffusion transformer that can also be used for text-to-image generation. Things to Try Some ideas for experimenting with the HunyuanCaptioner model include: Trying different prompt templates to generate captions in various styles or with inserted knowledge Comparing the model's performance on a diverse set of images, including those with complex scenes or unusual subjects Exploring how the model handles multi-turn interactions, where the user can refine or build upon the initial caption Integrating the HunyuanCaptioner into a larger application or system to enhance its capabilities, such as combining it with a DialogGen model for more advanced text-to-image generation.

Read more

Updated 7/31/2024

Text-to-Image

🤖

Total Score

662

HunyuanVideo

tencent

HunyuanVideo: A Systematic Framework For Large Video Generation Model Training This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our paper exploring HunyuanVideo. You can find more visualizations on our project page. HunyuanVideo: A Systematic Framework For Large Video Generation Model Training News!! Dec 3, 2024: We release the inference code and model weights of HunyuanVideo. Open-source Plan HunyuanVideo (Text-to-Video Model) Inference Checkpoints Penguin Video Benchmark Web Demo (Gradio) ComfyUI Diffusers HunyuanVideo (Image-to-Video Model) Inference Checkpoints Contents HunyuanVideo: A Systematic Framework For Large Video Generation Model Training News!! Open-source Plan Contents Abstract HunyuanVideo Overall Architechture HunyuanVideo Key Features Unified Image and Video Generative Architecture MLLM Text Encoder 3D VAE Prompt Rewrite Comparisons Requirements Dependencies and Installation Installation Guide for Linux Download Pretrained Models Inference Using Command Line More Configurations BibTeX Acknowledgements Abstract We present HunyuanVideo, a novel open-source video foundation model that exhibits performance in video generation that is comparable to, if not superior to, leading closed-source models. HunyuanVideo features a comprehensive framework that integrates several key contributions, including data curation, image-video joint model training, and an efficient infrastructure designed to facilitate large-scale model training and inference. Additionally, through an effective strategy for scaling model architecture and dataset, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion diversity, text-video alignment, and generation stability. According to professional human evaluation results, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and 3 top performing Chinese video generative models. By releasing the code and weights of the foundation model and its applications, we aim to bridge the gap between closed-source and open-source video foundation models. This initiative will empower everyone in the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. HunyuanVideo Overall Architechture HunyuanVideo is trained on a spatial-temporally compressed latent space, which is compressed through Causal 3D VAE. Text prompts are encoded using a large language model, and used as the condition. Gaussian noise and condition are taken as input, our generate model generates an output latent, which is decoded to images or videos through the 3D VAE decoder. HunyuanVideo Key Features Unified Image and Video Generative Architecture HunyuanVideo introduces the Transformer design and employs a Full Attention mechanism for unified image and video generation. Specifically, we use a "Dual-stream to Single-stream" hybrid model design for video generation. In the dual-stream phase, video and text tokens are processed independently through multiple Transformer blocks, enabling each modality to learn its own appropriate modulation mechanisms without interference. In the single-stream phase, we concatenate the video and text tokens and feed them into subsequent Transformer blocks for effective multimodal information fusion. This design captures complex interactions between visual and semantic information, enhancing overall model performance. MLLM Text Encoder Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as text encoders where CLIP uses Transformer Encoder and T5 uses a Encoder-Decoder structure. In constrast, we utilize a pretrained Multimodal Large Language Model (MLLM) with a Decoder-Only structure as our text encoder, which has following advantages: (i) Compared with T5, MLLM after visual instruction finetuning has better image-text alignment in the feature space, which alleviates the difficulty of instruction following in diffusion models; (ii) Compared with CLIP, MLLM has been demonstrated superior ability in image detail description and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features. 3D VAE HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate. Prompt Rewrite To address the variability in linguistic style and length of user-provided prompts, we fine-tune the Hunyuan-Large model as our prompt rewrite model to adapt the original user prompt to model-preferred prompt. We provide two rewrite modes: Normal mode and Master mode, which can be called using different prompts. The Normal mode is designed to enhance the video generation model's comprehension of user intent, facilitating a more accurate interpretation of the instructions provided. The Master mode enhances the description of aspects such as composition, lighting, and camera movement, which leans towards generating videos with a higher visual quality. However, this emphasis may occasionally result in the loss of some semantic details. The Prompt Rewrite Model can be directly deployed and inferred using the Hunyuan-Large original code. We release the weights of the Prompt Rewrite Model here. Comparisons To evaluate the performance of HunyuanVideo, we selected five strong baselines from closed-source video generation models. In total, we utilized 1,533 text prompts, generating an equal number of video samples with HunyuanVideo in a single run. For a fair comparison, we conducted inference only once, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models, ensuring consistent video resolution. Videos were assessed based on three criteria: Text Alignment, Motion Quality and Visual Quality. More than 60 professional evaluators performed the evaluation. Notably, HunyuanVideo demonstrated the best overall performance, particularly excelling in motion quality. Model Open Source Duration Text Alignment Motion Quality Visual Quality Overall Ranking HunyuanVideo (Ours) 5s 61.8% 66.5% 95.7% 41.3% 1 CNTopA (API) 5s 62.6% 61.7% 95.6% 37.7% 2 CNTopB (Web) 5s 60.1% 62.9% 97.7% 37.5% 3 GEN-3 alpha (Web) 6s 47.7% 54.7% 97.5% 27.4% 4 Luma1.6 (API) 5s 57.6% 44.2% 94.1% 24.8% 6 CNTopC (Web) 5s 48.4% 47.2% 96.3% 24.6% 5 Requirements The following table shows the requirements for running HunyuanVideo model (batch size = 1) to generate videos: Model GPU Setting (height/width/frame) Denoising step GPU Peak Memory HunyuanVideo H800 720px1280px129f 30 60G HunyuanVideo H800 544px960px129f 30 45G HunyuanVideo H20 720px1280px129f 30 60G HunyuanVideo H20 544px960px129f 30 45G An NVIDIA GPU with CUDA support is required. We have tested on a single H800/H20 GPU. Minimum: The minimum GPU memory required is 60GB for 720px1280px129f and 45G for 544px960px129f. Recommended: We recommend using a GPU with 80GB of memory for better generation quality. Tested operating system: Linux Dependencies and Installation Begin by cloning the repository: git clone https://github.com/tencent/HunyuanVideo cd HunyuanVideo Installation Guide for Linux We provide an environment.yml file for setting up a Conda environment. Conda's installation instructions are available here. We recommend CUDA versions 11.8 and 12.0+. 1. Prepare conda environment conda env create -f environment.yml 2. Activate the environment conda activate HunyuanVideo 3. Install pip dependencies python -m pip install -r requirements.txt 4. Install flash attention v2 for acceleration (requires CUDA 11.8 or above) python -m pip install git+https://github.com/Dao-AILab/[email protected] Additionally, HunyuanVideo also provides a pre-built Docker image: docker\_hunyuanvideo. 1. Use the following link to download the docker image tar file (For CUDA 12). wget https://aivideo.hunyuan.tencent.com/download/HunyuanVideo/hunyuan_video_cu12.tar 2. Import the docker tar file and show the image meta information (For CUDA 12). docker load -i hunyuan_video.tar docker image ls 3. Run the container based on the image docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged docker_image_tag Download Pretrained Models The details of download pretrained models are shown here. Inference We list the height/width/frame settings we support in the following table. Resolution h/w=9:16 h/w=16:9 h/w=4:3 h/w=3:4 h/w=1:1 540p 544px960px129f 960px544px129f 624px832px129f 832px624px129f 720px720px129f 720p (recommended) 720px1280px129f 1280px720px129f 1104px832px129f 832px1104px129f 960px960px129f Using Command Line cd HunyuanVideo python3 sample_video.py \ --video-size 720 1280 \ --video-length 129 \ --infer-steps 30 \ --prompt "a cat is running, realistic." \ --flow-reverse \ --seed 0 \ --use-cpu-offload \ --save-path ./results More Configurations We list some more useful configurations for easy usage: Argument Default Description --prompt None The text prompt for video generation --video-size 720 1280 The size of the generated video --video-length 129 The length of the generated video --infer-steps 30 The number of steps for sampling --embedded-cfg-scale 6.0 Embeded Classifier free guidance scale --flow-shift 9.0 Shift factor for flow matching schedulers --flow-reverse False If reverse, learning/sampling from t=1 -> t=0 --neg-prompt None The negative prompt for video generation --seed 0 The random seed for generating video --use-cpu-offload False Use CPU offload for the model load to save more memory, necessary for high-res video generation --save-path ./results Path to save the generated video BibTeX If you find HunyuanVideo useful for your research and applications, please cite using this BibTeX: @misc{kong2024hunyuanvideo, title={HunyuanVideo: A Systematic Framework For Large Video Generative Models}, author={Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Junkun Yuan, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yanxin Long, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Yangyu Tao, Qinglin Lu, Songtao Liu, Dax Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, and Jie Jiang, along with Caesar Zhong}, year={2024}, archivePrefix={arXiv}, primaryClass={cs.CV} } Acknowledgements We would like to thank the contributors to the SD3, FLUX, Llama, LLaVA, Xtuner, diffusers and HuggingFace repositories, for their open research and exploration. Additionally, we also thank the Tencent Hunyuan Multimodal team for their help with the text encoder.

Read more

Updated 12/7/2024

Video-to-Video