Related papers: VC4VG: Optimizing Video Captions for Text-to-Video Generation

VC4VG: Optimizing Video Captions for Text-to-Video Generation

URL: http://arxiv.org/abs/2510.24134v2
Date: Wed, 29 Oct 2025 19:17:39 GMT
Title: VC4VG: Optimizing Video Captions for Text-to-Video Generation
Authors: Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo Zheng, Qin Jin,
Abstract summary: We introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.<n>To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.
Score: 60.4614929018261
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.

Related papers

Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis [14.980220974022982]
We introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness.<n>Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames.<n>We also employ T2V backbones to ensure consistent motion dynamics.
arXiv Detail & Related papers (2025-07-18T08:59:02Z)
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization [63.37161241355025]
Video-MSG is a training-free method for T2V generation based on Multimodal planning and Structured noise initialization.<n>It guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising.<n>Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models.
arXiv Detail & Related papers (2025-04-11T15:41:43Z)
VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation [44.05151169366881]
This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation.<n>VidCapBench associates each collected video with key information spanning video aesthetics, content, motion, and physical laws.<n>We demonstrate the superior stability and comprehensiveness of VidCapBench compared to existing video captioning evaluation approaches.
arXiv Detail & Related papers (2025-02-18T11:42:17Z)
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model [133.01510927611452]
We present Step-Video-T2V, a text-to-video pre-trained model with 30Bational parameters and the ability to generate videos up to 204 frames in length.<n>A deep compression Vari Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios.<n>Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality.
arXiv Detail & Related papers (2025-02-14T15:58:10Z)
T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos [9.742383920787413]
T2VEval is a multi-branch fusion scheme for text-to-video quality evaluation.<n>It assesses videos across three branches: text-video consistency, realness, and technical quality.<n>T2VEval achieves state-of-the-art performance across multiple metrics.
arXiv Detail & Related papers (2025-01-15T03:11:33Z)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z)
Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V) We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.