Local Frequency Domain Transformer Networks for Video Prediction
- URL: http://arxiv.org/abs/2105.04637v1
- Date: Mon, 10 May 2021 19:48:42 GMT
- Title: Local Frequency Domain Transformer Networks for Video Prediction
- Authors: Hafez Farazi, Jan Nogga, Sven Behnke
- Abstract summary: Video prediction is of interest not only in anticipating visual changes in the real world but has, above all, emerged as an unsupervised learning rule.
This paper proposes a fully differentiable building block that can perform all of those tasks separately while maintaining interpretability.
- Score: 24.126513851779936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video prediction is commonly referred to as forecasting future frames of a
video sequence provided several past frames thereof. It remains a challenging
domain as visual scenes evolve according to complex underlying dynamics, such
as the camera's egocentric motion or the distinct motility per individual
object viewed. These are mostly hidden from the observer and manifest as often
highly non-linear transformations between consecutive video frames. Therefore,
video prediction is of interest not only in anticipating visual changes in the
real world but has, above all, emerged as an unsupervised learning rule
targeting the formation and dynamics of the observed environment. Many of the
deep learning-based state-of-the-art models for video prediction utilize some
form of recurrent layers like Long Short-Term Memory (LSTMs) or Gated Recurrent
Units (GRUs) at the core of their models. Although these models can predict the
future frames, they rely entirely on these recurrent structures to
simultaneously perform three distinct tasks: extracting transformations,
projecting them into the future, and transforming the current frame. In order
to completely interpret the formed internal representations, it is crucial to
disentangle these tasks. This paper proposes a fully differentiable building
block that can perform all of those tasks separately while maintaining
interpretability. We derive the relevant theoretical foundations and showcase
results on synthetic as well as real data. We demonstrate that our method is
readily extended to perform motion segmentation and account for the scene's
composition, and learns to produce reliable predictions in an entirely
interpretable manner by only observing unlabeled video data.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - Visual Representation Learning with Stochastic Frame Prediction [90.99577838303297]
This paper revisits the idea of video generation that learns to capture uncertainty in frame prediction.
We design a framework that trains a frame prediction model to learn temporal information between frames.
We find this architecture allows for combining both objectives in a synergistic and compute-efficient manner.
arXiv Detail & Related papers (2024-06-11T16:05:15Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Stochastic Video Prediction with Structure and Motion [14.424465835834042]
We propose to factorize video observations into static and dynamic components.
By learning separate distributions of changes in foreground and background, we can decompose the scene into static and dynamic parts.
Our experiments demonstrate that disentangling structure and motion helps video prediction, leading to better future predictions in complex driving scenarios.
arXiv Detail & Related papers (2022-03-20T11:29:46Z) - Video Prediction at Multiple Scales with Hierarchical Recurrent Networks [24.536256844130996]
We propose a novel video prediction model able to forecast future possible outcomes of different levels of granularity simultaneously.
By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations over long time horizons.
In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations on various scenarios.
arXiv Detail & Related papers (2022-03-17T13:08:28Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - Learning Semantic-Aware Dynamics for Video Prediction [68.04359321855702]
We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions.
The appearance of the scene is warped from past frames using the predicted motion in co-visible regions.
arXiv Detail & Related papers (2021-04-20T05:00:24Z) - Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately.
Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z) - Future Video Synthesis with Object Motion Prediction [54.31508711871764]
Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics.
The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects.
Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
arXiv Detail & Related papers (2020-04-01T16:09:54Z) - Photo-Realistic Video Prediction on Natural Videos of Largely Changing
Frames [0.0]
We propose a deep residual network with the hierarchical architecture where each layer makes a prediction of future state at different spatial resolution.
We trained our model with adversarial and perceptual loss functions, and evaluated it on a natural video dataset captured by car-mounted cameras.
arXiv Detail & Related papers (2020-03-19T09:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.