LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
- URL: http://arxiv.org/abs/2410.10816v1
- Date: Mon, 14 Oct 2024 17:59:56 GMT
- Title: LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
- Authors: Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, Xihui Liu,
- Abstract summary: We introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions.
Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality.
We curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions.
- Score: 68.88624389174026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.
Related papers
- VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
This paper introduces a Hierarchical visual token Compression (HiCo) method designed for high-fidelity representation.
HiCo capitalizes on the redundancy of visual information in long videos to compress long video context from the clip-level to the video-level.
VideoChat-Flash shows the leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scale.
arXiv Detail & Related papers (2024-12-31T18:01:23Z) - Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation [2.4240014793575138]
As of this writing, OpenAI's Sora, the current state-of-the-art system, is still limited to producing videos that are up to one minute in length.
In this survey, we examine the current landscape of long video generation, covering techniques like GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of the existing video generation capabilities.
arXiv Detail & Related papers (2024-12-24T21:24:41Z) - Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation [16.80010133425332]
We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content.
Presto achieves 78.5% on the VBench Semantic Score and 100% splits on the Dynamic Degree, outperforming existing state-of-the-art video generation methods.
arXiv Detail & Related papers (2024-12-02T09:32:36Z) - MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation [62.85764872989189]
There is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models.
We present MovieBench: A Hierarchical Movie-Level dataset for Long Video Generation.
The dataset will be public and continuously maintained, aiming to advance the field of long video generation.
arXiv Detail & Related papers (2024-11-22T10:25:08Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - Multi-sentence Video Grounding for Long Video Generation [46.363084926441466]
We propose a brave and new idea of Multi-sentence Video Grounding for Long Video Generation.
Our approach seamlessly extends the development in image/video editing, video morphing and personalized generation, and video grounding to the long video generation.
arXiv Detail & Related papers (2024-07-18T07:05:05Z) - StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.
Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.