VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
- URL: http://arxiv.org/abs/2503.01739v1
- Date: Mon, 03 Mar 2025 17:00:36 GMT
- Title: VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
- Authors: Wenhao Wang, Yi Yang,
- Abstract summary: VideoUFO is the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios.<n>VideoUFO comprises over $1.09$ million video clips, each paired with a brief and a detailed caption.<n>Our experiments reveal that (1) current $16$ text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-hugging topics.
- Score: 22.782099757385804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal ($0.29\%$) overlap with existing video datasets, and (2) videos searched exclusively via YouTube's official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over $1.09$ million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify $1,291$ user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about $1.09$ million video clips. Our experiments reveal that (1) current $16$ text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset is publicly available at https://huggingface.co/datasets/WenhaoWang/VideoUFO under the CC BY 4.0 License.
Related papers
- HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation [99.6653979969241]
We introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos.
To guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs)
To obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy.
arXiv Detail & Related papers (2025-03-31T04:30:34Z) - MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation [62.85764872989189]
There is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models.
We present MovieBench: A Hierarchical Movie-Level dataset for Long Video Generation.
The dataset will be public and continuously maintained, aiming to advance the field of long video generation.
arXiv Detail & Related papers (2024-11-22T10:25:08Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [22.782099757385804]
VidProM is the first large-scale dataset comprising 1.67 Million unique text-to-Video Prompts from real users.
This dataset includes 6.69 million videos generated by four state-of-the-art diffusion models.
We suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models to develop better, more efficient, and safer models.
arXiv Detail & Related papers (2024-03-10T05:40:12Z) - Can Language Models Laugh at YouTube Short-form Videos? [40.47384055149102]
We curate a user-generated dataset of 10K multimodal funny videos from YouTube, called ExFunTube.
Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor.
After filtering, we annotate each video with timestamps and text explanations for funny moments.
arXiv Detail & Related papers (2023-10-22T03:01:38Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Knowledge Enhanced Model for Live Video Comment Generation [40.762720398152766]
We propose a knowledge enhanced generation model inspired by the divergent and informative nature of live video comments.
Our model adopts a pre-training encoder-decoder framework and incorporates external knowledge.
The MovieLC dataset and our code will be released.
arXiv Detail & Related papers (2023-04-28T07:03:50Z) - Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z) - Visual Semantic Role Labeling for Video Understanding [46.02181466801726]
We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling.
We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event.
We introduce the VidSitu benchmark, a large-scale video understanding data source with $29K$ $10$-second movie clips richly annotated with a verb and semantic-roles every $2$ seconds.
arXiv Detail & Related papers (2021-04-02T11:23:22Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.