Towards End-to-End Generative Modeling of Long Videos with
Memory-Efficient Bidirectional Transformers
- URL: http://arxiv.org/abs/2303.11251v3
- Date: Wed, 31 May 2023 15:02:51 GMT
- Title: Towards End-to-End Generative Modeling of Long Videos with
Memory-Efficient Bidirectional Transformers
- Authors: Jaehoon Yoo, Semin Kim, Doyup Lee, Chiheon Kim, Seunghoon Hong
- Abstract summary: We propose Memory-directional Bi-efficient Transformer (MeBT) for end-to-end learning of long-term dependency in videos.
Our method learns to decode entire-temporal volume of a video in parallel from partially observed patches.
- Score: 13.355338760884583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive transformers have shown remarkable success in video
generation. However, the transformers are prohibited from directly learning the
long-term dependency in videos due to the quadratic complexity of
self-attention, and inherently suffering from slow inference time and error
propagation due to the autoregressive process. In this paper, we propose
Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of
long-term dependency in videos and fast inference. Based on recent advances in
bidirectional transformers, our method learns to decode the entire
spatio-temporal volume of a video in parallel from partially observed patches.
The proposed transformer achieves a linear time complexity in both encoding and
decoding, by projecting observable context tokens into a fixed number of latent
tokens and conditioning them to decode the masked tokens through the
cross-attention. Empowered by linear complexity and bidirectional modeling, our
method demonstrates significant improvement over the autoregressive
Transformers for generating moderately long videos in both quality and speed.
Videos and code are available at https://sites.google.com/view/mebt-cvpr2023 .
Related papers
- iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - SViTT: Temporal Learning of Sparse Video-Text Transformers [65.93031164906812]
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention.
SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and sparsity that discards uninformative visual tokens.
arXiv Detail & Related papers (2023-04-18T08:17:58Z) - Two-Stream Transformer Architecture for Long Video Understanding [5.001789577362836]
This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features.
Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.
arXiv Detail & Related papers (2022-08-02T21:03:48Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths.
We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately.
The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Transformers are RNNs: Fast Autoregressive Transformers with Linear
Attention [22.228028613802174]
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, they are prohibitively slow for very long sequences.
We make use of the associativity property of matrix products to reduce the complexity from $mathcalOleft(N2right)$ to $mathcalOleft(Nright)$, where $N$ is the sequence length.
Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
arXiv Detail & Related papers (2020-06-29T17:55:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.