VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
- URL: http://arxiv.org/abs/2408.02629v1
- Date: Mon, 5 Aug 2024 16:53:23 GMT
- Title: VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
- Authors: Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Hao Li,
- Abstract summary: We present VidGen-1M, a superior training dataset for text-to-video models.
This dataset guarantees high-quality videos and detailed captions with excellent temporal consistency.
When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.
- Score: 9.726156628112198
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.
Related papers
- Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content [35.02160595617654]
temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality.
We introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality.
arXiv Detail & Related papers (2024-10-10T17:57:49Z) - FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset [15.917564646478628]
We create a high-quality multiracial face collection named textbfFaceVid-1K.
We conduct experiments with several well-established video generation models, including text-to-video, image-to-video, and unconditional video generation.
We obtain the corresponding performance benchmarks and compared them with those trained on public datasets to demonstrate the superiority of our dataset.
arXiv Detail & Related papers (2024-09-23T07:27:02Z) - Distilling Vision-Language Models on Millions of Videos [62.92789440875999]
We fine-tune a video-language model from a strong image-language baseline with synthesized instructional data.
The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions.
As a side product, we generate the largest video caption dataset to date.
arXiv Detail & Related papers (2024-01-11T18:59:53Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
Modeling [48.283659682112926]
We propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks.
We also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text.
arXiv Detail & Related papers (2022-10-21T13:03:49Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.