Related papers: ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access

ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access

URL: http://arxiv.org/abs/2511.18382v1
Date: Sun, 23 Nov 2025 10:19:56 GMT
Title: ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access
Authors: Timing Yang, Sucheng Ren, Alan Yuille, Feng Wang,
Abstract summary: ViMix-14M is a curated multi-source video-text dataset of around 14 million pairs.<n> ViMix-14M is built by merging diverse open video sources, followed by unified de-duplication and quality filtering.<n>We evaluate the dataset by multimodal retrieval, text-to-video generation, and video question answering tasks.
Score: 16.89068730775312
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-video generation has surged in interest since Sora, yet open-source models still face a data bottleneck: there is no large, high-quality, easily obtainable video-text corpus. Existing public datasets typically require manual YouTube crawling, which yields low usable volume due to link rot and access limits, and raises licensing uncertainty. This work addresses this challenge by introducing ViMix-14M, a curated multi-source video-text dataset of around 14 million pairs that provides crawl-free, download-ready access and long-form, high-quality captions tightly aligned to video. ViMix-14M is built by merging diverse open video sources, followed by unified de-duplication and quality filtering, and a multi-granularity, ground-truth-guided re-captioning pipeline that refines descriptions to better match actions, scenes, and temporal structure. We evaluate the dataset by multimodal retrieval, text-to-video generation, and video question answering tasks, observing consistent improvements over counterpart datasets. We hope this work can help removing the key barrier to training and fine-tuning open-source video foundation models, and provide insights of building high-quality and generalizable video-text datasets.

Related papers

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications [96.46069692338645]
We introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments.<n>Dense-WebVid-CoVR consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart.<n>We develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion.
arXiv Detail & Related papers (2025-08-19T17:59:39Z)
UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions [88.66676805439512]
The demand for video applications sets higher requirements for high-quality video generation models.<n>We first propose a high-quality open-sourced UHD-4K text-to-video dataset named UltraVideo.<n>Each video has 9 structured captions with one summarized caption (average of 824 words)
arXiv Detail & Related papers (2025-06-16T16:52:52Z)
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering [13.466266412068475]
We introduce the DocVideoQA task and dataset for the first time, comprising 1454 videos across 23 categories with a total duration of about 828 hours.<n>The dataset is annotated with 154k question-answer pairs generated manually and via GPT, assessing models' comprehension, temporal awareness, and modality integration capabilities.<n>Our method enhances unimodal feature extraction with diverse instruction-tuning data and employs contrastive learning to strengthen modality integration.
arXiv Detail & Related papers (2025-03-20T06:21:25Z)
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content [35.02160595617654]
We introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality.<n>We employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency.<n>We develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus.
arXiv Detail & Related papers (2024-10-10T17:57:49Z)
SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses [58.488812405557]
Video grounding aims to localize specific natural language queries in an untrimmed video. We present a large-scale video grounding dataset named SynopGround. We introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG)
arXiv Detail & Related papers (2024-08-03T05:35:13Z)
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z)
A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z)
QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video. A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description. The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.