[](#video-to-video)Video-to-Video
=================================

**MS-Vid2Vid-XL**I2VGen-XL720P>=7201280 \* 720(16:9)

**MS-Vid2Vid-XL** aims to improve the spatiotemporal continuity and resolution of video generation. It serves as the second stage of I2VGen-XL to generate 720P videos, and can also be used for various tasks such as text-to-video synthesis and high-quality video transfer. The training data includes a large collection of high-definition videos and images (with the shortest side >=720), allowing for the enhancement of low-resolution videos to higher resolutions (1280 \* 720). It can handle videos of almost any resolution (preferably 16:9 aspect ratio).

![](https://huggingface.co/damo-vilab/MS-Vid2Vid-XL/resolve/main/assets/images/Fig_1.png)  
Fig.1 MS-Vid2Vid-XL

(Project experience address): [https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary](https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary)

[](#-introduction) (Introduction)
-----------------------------------------

**MS-Vid2Vid-XL**I2VGen-XL(VLDM)UNet(ST-UNet)[VideoComposer](https://videocomposer.github.io)

**MS-Vid2Vid-XL** and the first stage of I2VGen-XL share the same underlying video latent diffusion model (VLDM). They both utilize a spatiotemporal UNet (ST-UNet) with the same structure, which is designed based on our in-house VideoComposer. For more specific details, please refer to its technical report.

  

  

  

  

  

  

  

  

### [](#-code-example) (Code example)

    from modelscope.pipelines import pipeline
    from modelscope.outputs import OutputKeys
    
    # VID_PATH: your video path
    # TEXT : your text description
    pipe = pipeline(task="video-to-video", model='damo/Video-to-Video')
    p_input = {
                'video_path': VID_PATH,
                'text': TEXT
            }
    
    output_video_path = pipe(p_input, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
    

### [](#-limitation) (Limitation)

**MS-Vid2Vid-XL**

*   
*   720P(160 \* 90)>2
*   

This **MS-Vid2Vid-XL** may have the following limitations:

*   There may be some blurriness when the target is far away. This issue can be addressed by providing input text.
*   Computation time is high due to the need to generate 720P videos. The latent space size is (160 \* 90), and the computation time for a single video is more than 2 minutes.
*   Currently, it only supports English. This is due to the training data, which is limited to English inputs at the moment.

[](#-reference) (Reference)
-----------------------------------------------

    @article{videocomposer2023,
      title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
      author={Wang, Xiang* and Yuan, Hangjie* and Zhang, Shiwei* and Chen, Dayou* and Wang, Jiuniu and Zhang, Yingya and Shen, Yujun and Zhao, Deli and Zhou, Jingren},
      journal={arXiv preprint arXiv:2306.02018},
      year={2023}
    }
    
    @inproceedings{videofusion2023,   
      title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},   
      author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},   
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},   
      year={2023}   
    }
    

[](#-license-agreement) (License Agreement)
---------------------------------------------------

/

Our code and model weights are only available for personal/academic research use and are currently not supported for commercial use.

## Model Overview

The **MS-Vid2Vid-XL** model aims to improve the spatiotemporal continuity and resolution of video generation. It serves as the second stage of the I2VGen-XL model to generate 720P videos. The model can also be used for various tasks such as text-to-video synthesis and high-quality video transfer. MS-Vid2Vid-XL utilizes the same underlying video latent diffusion model (VLDM) and spatiotemporal UNet (ST-UNet) as the first stage of I2VGen-XL, which is designed based on the VideoComposer project.

## Model Inputs and Outputs

### Inputs
- **Video Path**: The input video path to be processed.
- **Text**: The text description to guide the video generation.

### Outputs
- **Output Video**: The generated high-resolution video.

## Capabilities

MS-Vid2Vid-XL can generate high-definition (720P) and widescreen (16:9 aspect ratio) videos with improved spatiotemporal continuity and texture compared to existing open-source video generation models. The model has been trained on a large dataset of high-quality videos and images, allowing it to produce videos with good semantic consistency, temporal stability, and realistic textures.

## What Can I Use It For?

The **MS-Vid2Vid-XL** model can be used for a variety of applications, such as:

- [Text-to-Video Synthesis](https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary): Generate videos based on text descriptions.
- [High-Quality Video Transfer](https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary): Enhance the resolution and quality of existing low-resolution videos.
- [Video Generation for Media and Entertainment](https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary): Create high-quality video content for films, TV shows, and other media.

## Things to Try

While the **MS-Vid2Vid-XL** model can generate high-quality 720P videos, it may have some limitations. The model can sometimes produce blurry results when the target is far away, and the computation time for generating a single video is over 2 minutes due to the large latent space size. To address these issues, users can try providing more detailed text descriptions to guide the model's generation process.