UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions
- URL: http://arxiv.org/abs/2506.13691v1
- Date: Mon, 16 Jun 2025 16:52:52 GMT
- Title: UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions
- Authors: Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, Yuxuan Cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Dacheng Tao,
- Abstract summary: The demand for video applications sets higher requirements for high-quality video generation models.<n>We first propose a high-quality open-sourced UHD-4K text-to-video dataset named UltraVideo.<n>Each video has 9 structured captions with one summarized caption (average of 824 words)
- Score: 88.66676805439512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of the video dataset (image quality, resolution, and fine-grained caption) greatly influences the performance of the video generation model. The growing demand for video applications sets higher requirements for high-quality video generation models. For example, the generation of movie-level Ultra-High Definition (UHD) videos and the creation of 4K short video content. However, the existing public datasets cannot support related research and applications. In this paper, we first propose a high-quality open-sourced UHD-4K (22.4\% of which are 8K) text-to-video dataset named UltraVideo, which contains a wide range of topics (more than 100 kinds), and each video has 9 structured captions with one summarized caption (average of 824 words). Specifically, we carefully design a highly automated curation process with four stages to obtain the final high-quality dataset: \textit{i)} collection of diverse and high-quality video clips. \textit{ii)} statistical data filtering. \textit{iii)} model-based data purification. \textit{iv)} generation of comprehensive, structured captions. In addition, we expand Wan to UltraWan-1K/-4K, which can natively generate high-quality 1K/4K videos with more consistent text controllability, demonstrating the effectiveness of our data curation.We believe that this work can make a significant contribution to future research on UHD video generation. UltraVideo dataset and UltraWan models are available at https://xzc-zju.github.io/projects/UltraVideo.
Related papers
- VideoAuteur: Towards Long Narrative Video Generation [22.915448471769384]
We present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain.<n>We introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos.<n>Our method demonstrates substantial improvements in generating visually detailed and semantically aligneds.
arXiv Detail & Related papers (2025-01-10T18:52:11Z) - Movie Gen: A Cast of Media Foundation Models [133.41504332082667]
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio.<n>We show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image.
arXiv Detail & Related papers (2024-10-17T16:22:46Z) - LVD-2M: A Long-take Video Dataset with Temporally Dense Captions [68.88624389174026]
We introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions.
Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality.
We curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions.
arXiv Detail & Related papers (2024-10-14T17:59:56Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - VidGen-1M: A Large-Scale Dataset for Text-to-video Generation [9.726156628112198]
We present VidGen-1M, a superior training dataset for text-to-video models.
This dataset guarantees high-quality videos and detailed captions with excellent temporal consistency.
When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.
arXiv Detail & Related papers (2024-08-05T16:53:23Z) - ShareGPT4Video: Improving Video Understanding and Generation with Better Captions [93.29360532845062]
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions.
The series comprises: ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy.
We further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos.
arXiv Detail & Related papers (2024-06-06T17:58:54Z) - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [22.782099757385804]
VidProM is the first large-scale dataset comprising 1.67 Million unique text-to-Video Prompts from real users.
This dataset includes 6.69 million videos generated by four state-of-the-art diffusion models.
We suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models to develop better, more efficient, and safer models.
arXiv Detail & Related papers (2024-03-10T05:40:12Z) - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers [93.65253661843145]
We propose an automatic approach to establish a video dataset with high-quality captions.
Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset.
We then apply multiple cross-modality teacher models to obtain captions for each video.
In this way, we get 70M videos paired with high-quality text captions.
arXiv Detail & Related papers (2024-02-29T18:59:50Z) - Distilling Vision-Language Models on Millions of Videos [62.92789440875999]
We fine-tune a video-language model from a strong image-language baseline with synthesized instructional data.
The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions.
As a side product, we generate the largest video caption dataset to date.
arXiv Detail & Related papers (2024-01-11T18:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.