Generating Long Videos of Dynamic Scenes
- URL: http://arxiv.org/abs/2206.03429v2
- Date: Thu, 9 Jun 2022 06:24:12 GMT
- Title: Generating Long Videos of Dynamic Scenes
- Authors: Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila,
Jaakko Lehtinen, Ming-Yu Liu, Alexei A. Efros, Tero Karras
- Abstract summary: We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
- Score: 66.56925105992472
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a video generation model that accurately reproduces object motion,
changes in camera viewpoint, and new content that arises over time. Existing
video generation methods often fail to produce new content as a function of
time while maintaining consistencies expected in real environments, such as
plausible dynamics and object persistence. A common failure case is for content
to never change due to over-reliance on inductive biases to provide temporal
consistency, such as a single latent code that dictates content for the entire
video. On the other extreme, without long-term consistency, generated videos
may morph unrealistically between different scenes. To address these
limitations, we prioritize the time axis by redesigning the temporal latent
representation and learning long-term consistency from data by training on
longer videos. To this end, we leverage a two-phase training strategy, where we
separately train using longer videos at a low resolution and shorter videos at
a high resolution. To evaluate the capabilities of our model, we introduce two
new benchmark datasets with explicit focus on long-term temporal dynamics.
Related papers
- SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - Anchored Diffusion for Video Face Reenactment [17.343307538702238]
We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos.
We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance.
During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
arXiv Detail & Related papers (2024-07-21T13:14:17Z) - StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [117.13475564834458]
We propose a new way of self-attention calculation, termed Consistent Self-Attention.
To extend our method to long-range video generation, we introduce a novel semantic space temporal motion prediction module.
By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos.
arXiv Detail & Related papers (2024-05-02T16:25:16Z) - Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.