ContentV: Efficient Training of Video Generation Models with Limited   Compute
        - URL: http://arxiv.org/abs/2506.05343v2
 - Date: Wed, 11 Jun 2025 15:48:38 GMT
 - Title: ContentV: Efficient Training of Video Generation Models with Limited   Compute
 - Authors: Wenfeng Lin, Renjie Chen, Boyuan Liu, Shiyue Yan, Ruoyu Feng, Jiangchuan Wei, Yichen Zhang, Yimeng Zhou, Chao Feng, Jiao Ran, Qi Wu, Zuotao Liu, Mingyu Guo, 
 - Abstract summary: ContentV is a text-to-video model that generates diverse, high-quality videos across multiple resolutions and durations from text prompts.<n>It achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks.
 - Score: 16.722018026516867
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io. 
 
       
      
        Related papers
        - AMD-Hummingbird: Towards an Efficient Text-to-Video Model [12.09360569154206]
Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions.<n>Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment.<n>We propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning.
arXiv  Detail & Related papers  (2025-03-24T11:13:33Z) - Pretrained Image-Text Models are Secretly Video Captioners [38.66202065611397]
We find that an image-based model can be repurposed to outperform several specialised video captioning systems.<n>Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX.<n>From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning.
arXiv  Detail & Related papers  (2025-02-19T01:53:03Z) - Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models [89.79067761383855]
Vchitect-2.0 is a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.<n>By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames.<n>To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework.
arXiv  Detail & Related papers  (2025-01-14T21:53:11Z) - Movie Gen: A Cast of Media Foundation Models [133.41504332082667]
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio.<n>We show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image.
arXiv  Detail & Related papers  (2024-10-17T16:22:46Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv  Detail & Related papers  (2024-06-06T17:58:27Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled   Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv  Detail & Related papers  (2024-02-05T16:30:49Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
  Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv  Detail & Related papers  (2023-11-25T22:28:38Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv  Detail & Related papers  (2023-05-22T15:54:22Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
  Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv  Detail & Related papers  (2020-02-15T10:03:25Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.