Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
- URL: http://arxiv.org/abs/2501.08453v1
- Date: Tue, 14 Jan 2025 21:53:11 GMT
- Title: Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
- Authors: Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, Yi Wang, Yuming Jiang, Yaohui Wang, Peng Gao, Xinyuan Chen, Hengjie Li, Dahua Lin, Yu Qiao, Ziwei Liu,
- Abstract summary: Vchitect-2.0 is a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.
By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames.
To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework.
- Score: 89.79067761383855
- License:
- Abstract: We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across sequences. (2) To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework that incorporates hybrid parallelism and other memory reduction techniques, enabling efficient training of long video sequences on distributed systems. (3) Additionally, our enhanced data processing pipeline ensures the creation of Vchitect T2V DataVerse, a high-quality million-scale training dataset through rigorous annotation and aesthetic evaluation. Extensive benchmarking demonstrates that Vchitect-2.0 outperforms existing methods in video quality, training efficiency, and scalability, serving as a suitable base for high-fidelity video generation.
Related papers
- DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training [85.04885553561164]
Diffusion Transformers (DiTs) have shown remarkable performance in modeling and generating high-quality videos.
This paper introduces DSV, a novel framework designed to accelerate and scale the training of video DiTs.
arXiv Detail & Related papers (2025-02-11T14:39:59Z) - VidTok: A Versatile and Open-Source Video Tokenizer [24.018360305535307]
VidTok is a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations.
By integrating these advancements, VidTok achieves substantial improvements over existing methods.
arXiv Detail & Related papers (2024-12-17T16:27:11Z) - Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling [14.450847211200292]
Video understanding has become increasingly important with the rise of multi-modality applications.
We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling.
C-VUE has three key designs. The first is a long-range history modeling technique that uses a video-aware approach to retain historical video information.
The second is a spatial redundancy reduction technique, which enhances the efficiency of history modeling based on temporal relations.
arXiv Detail & Related papers (2024-10-19T05:50:00Z) - T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design [79.7289790249621]
Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals.
We highlight the crucial importance of tailoring datasets to specific learning objectives.
We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver.
arXiv Detail & Related papers (2024-10-08T04:30:06Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - Boost Video Frame Interpolation via Motion Adaptation [73.42573856943923]
Video frame (VFI) is a challenging task that aims to generate intermediate frames between two consecutive frames in a video.
Existing learning-based VFI methods have achieved great success, but they still suffer from limited generalization ability.
We propose a novel optimization-based VFI method that can adapt to unseen motions at test time.
arXiv Detail & Related papers (2023-06-24T10:44:02Z) - Advancing High-Resolution Video-Language Representation with Large-Scale
Video Transcriptions [31.4943447481144]
We study joint and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream tasks.
Our model achieves new state-of-the-art results in 10 understanding tasks and 2 more novel text-to-visual generation tasks.
arXiv Detail & Related papers (2021-11-19T17:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.