ContentV: Efficient Training of Video Generation Models with Limited Compute
- URL: http://arxiv.org/abs/2506.05343v2
- Date: Wed, 11 Jun 2025 15:48:38 GMT
- Title: ContentV: Efficient Training of Video Generation Models with Limited Compute
- Authors: Wenfeng Lin, Renjie Chen, Boyuan Liu, Shiyue Yan, Ruoyu Feng, Jiangchuan Wei, Yichen Zhang, Yimeng Zhou, Chao Feng, Jiao Ran, Qi Wu, Zuotao Liu, Mingyu Guo,
- Abstract summary: ContentV is a text-to-video model that generates diverse, high-quality videos across multiple resolutions and durations from text prompts.<n>It achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks.
- Score: 16.722018026516867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io.
Related papers
- AMD-Hummingbird: Towards an Efficient Text-to-Video Model [12.09360569154206]
Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions.<n>Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment.<n>We propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning.
arXiv Detail & Related papers (2025-03-24T11:13:33Z) - Pretrained Image-Text Models are Secretly Video Captioners [38.66202065611397]
We find that an image-based model can be repurposed to outperform several specialised video captioning systems.<n>Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX.<n>From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning.
arXiv Detail & Related papers (2025-02-19T01:53:03Z) - Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models [89.79067761383855]
Vchitect-2.0 is a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.<n>By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames.<n>To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework.
arXiv Detail & Related papers (2025-01-14T21:53:11Z) - Movie Gen: A Cast of Media Foundation Models [133.41504332082667]
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio.<n>We show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image.
arXiv Detail & Related papers (2024-10-17T16:22:46Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.