VPTR: Efficient Transformers for Video Prediction
- URL: http://arxiv.org/abs/2203.15836v1
- Date: Tue, 29 Mar 2022 18:09:09 GMT
- Title: VPTR: Efficient Transformers for Video Prediction
- Authors: Xi Ye, Guillaume-Alexandre Bilodeau
- Abstract summary: We propose a new Transformer block for video future frames prediction based on an efficient local spatial-temporal separation attention mechanism.
Based on this new Transformer block, a fully autoregressive video future frames prediction Transformer is proposed.
A non-autoregressive video prediction Transformer is also proposed to increase the inference speed and reduce the accumulated inference errors of its autoregressive counterpart.
- Score: 14.685237010856953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a new Transformer block for video future frames
prediction based on an efficient local spatial-temporal separation attention
mechanism. Based on this new Transformer block, a fully autoregressive video
future frames prediction Transformer is proposed. In addition, a
non-autoregressive video prediction Transformer is also proposed to increase
the inference speed and reduce the accumulated inference errors of its
autoregressive counterpart. In order to avoid the prediction of very similar
future frames, a contrastive feature loss is applied to maximize the mutual
information between predicted and ground-truth future frame features. This work
is the first that makes a formal comparison of the two types of attention-based
video future frames prediction models over different scenarios. The proposed
models reach a performance competitive with more complex state-of-the-art
models. The source code is available at \emph{https://github.com/XiYe20/VPTR}.
Related papers
- Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - State-space Decomposition Model for Video Prediction Considering Long-term Motion Trend [3.910356300831074]
We propose a state-space decomposition video prediction model that decomposes the overall video frame generation into deterministic appearance prediction and motion prediction.
We infer the long-term motion trend from conditional frames to guide the generation of future frames that exhibit high consistency with the conditional frames.
arXiv Detail & Related papers (2024-04-17T17:19:48Z) - STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video
Prediction [20.701792842768747]
We propose a novel video prediction model, which has infinite-dimensional latent variables over the temporal domain.
Our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way, with an arbitrarily high frame rate.
arXiv Detail & Related papers (2023-12-11T16:12:43Z) - Video Prediction by Efficient Transformers [14.685237010856953]
We present a new family of Transformer-based models for video prediction.
Experiments show that the proposed video prediction models are competitive with more complex state-of-the-art convolutional-LSTM based models.
arXiv Detail & Related papers (2022-12-12T16:46:48Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - TTVFI: Learning Trajectory-Aware Transformer for Video Frame
Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames.
We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI)
Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z) - Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes.
Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset.
Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z) - Revisiting Hierarchical Approach for Persistent Long-Term Video
Prediction [55.4498466252522]
We set a new standard of video prediction with orders of magnitude longer prediction time than existing approaches.
Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation.
We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon.
arXiv Detail & Related papers (2021-04-14T08:39:38Z) - M-LVC: Multiple Frames Prediction for Learned Video Compression [111.50760486258993]
We propose an end-to-end learned video compression scheme for low-latency scenarios.
In our scheme, the motion vector (MV) field is calculated between the current frame and the previous one.
Experimental results show that the proposed method outperforms the existing learned video compression methods for low-latency mode.
arXiv Detail & Related papers (2020-04-21T20:42:02Z) - Photo-Realistic Video Prediction on Natural Videos of Largely Changing
Frames [0.0]
We propose a deep residual network with the hierarchical architecture where each layer makes a prediction of future state at different spatial resolution.
We trained our model with adversarial and perceptual loss functions, and evaluated it on a natural video dataset captured by car-mounted cameras.
arXiv Detail & Related papers (2020-03-19T09:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.