Video Prediction by Efficient Transformers
- URL: http://arxiv.org/abs/2212.06026v1
- Date: Mon, 12 Dec 2022 16:46:48 GMT
- Title: Video Prediction by Efficient Transformers
- Authors: Xi Ye, Guillaume-Alexandre Bilodeau
- Abstract summary: We present a new family of Transformer-based models for video prediction.
Experiments show that the proposed video prediction models are competitive with more complex state-of-the-art convolutional-LSTM based models.
- Score: 14.685237010856953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video prediction is a challenging computer vision task that has a wide range
of applications. In this work, we present a new family of Transformer-based
models for video prediction. Firstly, an efficient local spatial-temporal
separation attention mechanism is proposed to reduce the complexity of standard
Transformers. Then, a full autoregressive model, a partial autoregressive model
and a non-autoregressive model are developed based on the new efficient
Transformer. The partial autoregressive model has a similar performance with
the full autoregressive model but a faster inference speed. The
non-autoregressive model not only achieves a faster inference speed but also
mitigates the quality degradation problem of the autoregressive counterparts,
but it requires additional parameters and loss function for learning. Given the
same attention mechanism, we conducted a comprehensive study to compare the
proposed three video prediction variants. Experiments show that the proposed
video prediction models are competitive with more complex state-of-the-art
convolutional-LSTM based models. The source code is available at
https://github.com/XiYe20/VPTR.
Related papers
- A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR [0.31077024712075796]
Punctuation and word casing prediction are necessary for automatic speech recognition (ASR)
We propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time.
arXiv Detail & Related papers (2024-07-18T04:01:12Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - VPTR: Efficient Transformers for Video Prediction [14.685237010856953]
We propose a new Transformer block for video future frames prediction based on an efficient local spatial-temporal separation attention mechanism.
Based on this new Transformer block, a fully autoregressive video future frames prediction Transformer is proposed.
A non-autoregressive video prediction Transformer is also proposed to increase the inference speed and reduce the accumulated inference errors of its autoregressive counterpart.
arXiv Detail & Related papers (2022-03-29T18:09:09Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z) - A Log-likelihood Regularized KL Divergence for Video Prediction with A
3D Convolutional Variational Recurrent Network [17.91970304953206]
We introduce a new variational model that extends the recurrent network in two ways for the task of frame prediction.
First, we introduce 3D convolutions inside all modules including the recurrent model for future prediction frame, inputting sequence and outputting video frames at each timestep.
Second, we enhance the latent loss predictions of the variational model by introducing a maximum likelihood estimate in addition to the KL that is commonly used in variational models.
arXiv Detail & Related papers (2020-12-11T05:05:31Z) - Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech
Recognition [66.47000813920617]
We propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition.
The proposed model can accurately predict the length of the target sequence and achieve a competitive performance.
The model even achieves a real-time factor of 0.0056, which exceeds all mainstream speech recognition models.
arXiv Detail & Related papers (2020-05-16T08:27:20Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.