CI-VID: A Coherent Interleaved Text-Video Dataset
- URL: http://arxiv.org/abs/2507.01938v1
- Date: Wed, 02 Jul 2025 17:48:01 GMT
- Title: CI-VID: A Coherent Interleaved Text-Video Dataset
- Authors: Yiming Ju, Jijin Hu, Zhengxiong Luo, Haoge Deng, hanyu Zhao, Li Du, Chengwei Wu, Donglin Hao, Xinlong Wang, Tengfei Pan,
- Abstract summary: CI-VID is a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation.<n>It contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions.<n>We show that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences.
- Score: 23.93099552431937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation, enabling models to produce coherent, multi-scene video sequences. CI-VID contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions that capture both the individual content of each clip and the transitions between them, enabling visually and textually grounded generation. To further validate the effectiveness of CI-VID, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences. This facilitates the creation of story-driven content with smooth visual transitions and strong temporal coherence, underscoring the quality and practical utility of the CI-VID dataset We release the CI-VID dataset and the accompanying code for data construction and evaluation at: https://github.com/ymju-BAAI/CI-VID
Related papers
- Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency [32.16082566679126]
We present a unified framework for joint generation of driving videos and LiDAR sequences.<n>We employ a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE modalities, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling.<n>To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-level models that provides scene-level and instance-language supervision.
arXiv Detail & Related papers (2025-06-09T07:20:49Z) - HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks.<n>It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z) - Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos [15.781862060265519]
CFC-VIDS-1M is a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline.<n>We develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms.
arXiv Detail & Related papers (2025-02-28T18:56:35Z) - Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content [35.02160595617654]
We introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality.<n>We employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency.<n>We develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus.
arXiv Detail & Related papers (2024-10-10T17:57:49Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.<n>We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.