Temporally Consistent Transformers for Video Generation
- URL: http://arxiv.org/abs/2210.02396v2
- Date: Wed, 31 May 2023 20:19:01 GMT
- Title: Temporally Consistent Transformers for Video Generation
- Authors: Wilson Yan, Danijar Hafner, Stephen James, Pieter Abbeel
- Abstract summary: To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world.
No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies.
We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
- Score: 80.45230642225913
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To generate accurate videos, algorithms have to understand the spatial and
temporal dependencies in the world. Current algorithms enable accurate
predictions over short horizons but tend to suffer from temporal
inconsistencies. When generated content goes out of view and is later
revisited, the model invents different content instead. Despite this severe
limitation, no established benchmarks on complex data exist for rigorously
evaluating video generation with long temporal dependencies. In this paper, we
curate 3 challenging video datasets with long-range dependencies by rendering
walks through 3D scenes of procedural mazes, Minecraft worlds, and indoor
scans. We perform a comprehensive evaluation of current models and observe
their limitations in temporal consistency. Moreover, we introduce the
Temporally Consistent Transformer (TECO), a generative model that substantially
improves long-term consistency while also reducing sampling time. By
compressing its input sequence into fewer embeddings, applying a temporal
transformer, and expanding back using a spatial MaskGit, TECO outperforms
existing models across many metrics. Videos are available on the website:
https://wilson1yan.github.io/teco
Related papers
- MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models [14.024240637175216]
We propose a novel 4D point cloud video understanding backbone based on the recently advanced State Space Models (SSMs)
Specifically, our backbone begins by disentangling space and time in raw 4D geometries, and then establishing semantic-temporal videos.
Our method has an 87.5% memory reduction, 5.36 times speedup, and much higher accuracy (up to +104%) compared with transformer-based counterparts MS3D.
arXiv Detail & Related papers (2024-05-23T09:08:09Z) - Spatial Decomposition and Temporal Fusion based Inter Prediction for
Learned Video Compression [59.632286735304156]
We propose a spatial decomposition and temporal fusion based inter prediction for learned video compression.
With the SDD-based motion model and long short-term temporal fusion, our proposed learned video can obtain more accurate inter prediction contexts.
arXiv Detail & Related papers (2024-01-29T03:30:21Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - SViTT: Temporal Learning of Sparse Video-Text Transformers [65.93031164906812]
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention.
SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and sparsity that discards uninformative visual tokens.
arXiv Detail & Related papers (2023-04-18T08:17:58Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Real-time Online Video Detection with Temporal Smoothing Transformers [4.545986838009774]
A good streaming recognition model captures both long-term dynamics and short-term changes of video.
To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel.
We build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead.
arXiv Detail & Related papers (2022-09-19T17:59:02Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.