VC4VG: Optimizing Video Captions for Text-to-Video Generation
- URL: http://arxiv.org/abs/2510.24134v2
- Date: Wed, 29 Oct 2025 19:17:39 GMT
- Title: VC4VG: Optimizing Video Captions for Text-to-Video Generation
- Authors: Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo Zheng, Qin Jin,
- Abstract summary: We introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.<n>To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.
- Score: 60.4614929018261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.
Related papers
- Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis [14.980220974022982]
We introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness.<n>Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames.<n>We also employ T2V backbones to ensure consistent motion dynamics.
arXiv Detail & Related papers (2025-07-18T08:59:02Z) - Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization [63.37161241355025]
Video-MSG is a training-free method for T2V generation based on Multimodal planning and Structured noise initialization.<n>It guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising.<n>Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models.
arXiv Detail & Related papers (2025-04-11T15:41:43Z) - VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation [44.05151169366881]
This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation.<n>VidCapBench associates each collected video with key information spanning video aesthetics, content, motion, and physical laws.<n>We demonstrate the superior stability and comprehensiveness of VidCapBench compared to existing video captioning evaluation approaches.
arXiv Detail & Related papers (2025-02-18T11:42:17Z) - Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model [133.01510927611452]
We present Step-Video-T2V, a text-to-video pre-trained model with 30Bational parameters and the ability to generate videos up to 204 frames in length.<n>A deep compression Vari Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios.<n>Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality.
arXiv Detail & Related papers (2025-02-14T15:58:10Z) - T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos [9.742383920787413]
T2VEval is a multi-branch fusion scheme for text-to-video quality evaluation.<n>It assesses videos across three branches: text-video consistency, realness, and technical quality.<n>T2VEval achieves state-of-the-art performance across multiple metrics.
arXiv Detail & Related papers (2025-01-15T03:11:33Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.