Related papers: ContentV: Efficient Training of Video Generation Models with Limited Compute

ContentV: Efficient Training of Video Generation Models with Limited Compute

URL: http://arxiv.org/abs/2506.05343v2
Date: Wed, 11 Jun 2025 15:48:38 GMT
Title: ContentV: Efficient Training of Video Generation Models with Limited Compute
Authors: Wenfeng Lin, Renjie Chen, Boyuan Liu, Shiyue Yan, Ruoyu Feng, Jiangchuan Wei, Yichen Zhang, Yimeng Zhou, Chao Feng, Jiao Ran, Qi Wu, Zuotao Liu, Mingyu Guo,
Abstract summary: ContentV is a text-to-video model that generates diverse, high-quality videos across multiple resolutions and durations from text prompts.<n>It achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks.
Score: 16.722018026516867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io.

Related papers

AMD-Hummingbird: Towards an Efficient Text-to-Video Model [12.09360569154206]
Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions.<n>Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment.<n>We propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning.
arXiv Detail & Related papers (2025-03-24T11:13:33Z)
Pretrained Image-Text Models are Secretly Video Captioners [38.66202065611397]
We find that an image-based model can be repurposed to outperform several specialised video captioning systems.<n>Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX.<n>From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning.
arXiv Detail & Related papers (2025-02-19T01:53:03Z)
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models [89.79067761383855]
Vchitect-2.0 is a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.<n>By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames.<n>To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework.
arXiv Detail & Related papers (2025-01-14T21:53:11Z)
Movie Gen: A Cast of Media Foundation Models [133.41504332082667]
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio.<n>We show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image.
arXiv Detail & Related papers (2024-10-17T16:22:46Z)
SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models. Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z)
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending. VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z)
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.