Stochastic Video Prediction with Structure and Motion
- URL: http://arxiv.org/abs/2203.10528v1
- Date: Sun, 20 Mar 2022 11:29:46 GMT
- Title: Stochastic Video Prediction with Structure and Motion
- Authors: Adil Kaan Akan, Sadra Safadoust, Erkut Erdem, Aykut Erdem, Fatma
G\"uney
- Abstract summary: We propose to factorize video observations into static and dynamic components.
By learning separate distributions of changes in foreground and background, we can decompose the scene into static and dynamic parts.
Our experiments demonstrate that disentangling structure and motion helps video prediction, leading to better future predictions in complex driving scenarios.
- Score: 14.424465835834042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While stochastic video prediction models enable future prediction under
uncertainty, they mostly fail to model the complex dynamics of real-world
scenes. For example, they cannot provide reliable predictions for scenes with a
moving camera and independently moving foreground objects in driving scenarios.
The existing methods fail to fully capture the dynamics of the structured world
by only focusing on changes in pixels. In this paper, we assume that there is
an underlying process creating observations in a video and propose to factorize
it into static and dynamic components. We model the static part based on the
scene structure and the ego-motion of the vehicle, and the dynamic part based
on the remaining motion of the dynamic objects. By learning separate
distributions of changes in foreground and background, we can decompose the
scene into static and dynamic parts and separately model the change in each.
Our experiments demonstrate that disentangling structure and motion helps
stochastic video prediction, leading to better future predictions in complex
driving scenarios on two real-world driving datasets, KITTI and Cityscapes.
Related papers
- GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis [71.24791230358065]
We introduce a novel framework that empowers 3D Gaussian representations with dynamic scene modeling and future scenario synthesis.
GaussianPrediction can forecast future states from any viewpoint, using video observations of dynamic scenes.
Our framework shows outstanding performance on both synthetic and real-world datasets, demonstrating its efficacy in predicting and rendering future environments.
arXiv Detail & Related papers (2024-05-30T06:47:55Z) - DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local
Spherical-BEV Perception [54.02566476357383]
We propose the first Dynamic Environment MOtion Synthesis framework (DEMOS) to predict future motion instantly according to the current scene.
We then use it to dynamically update the latent motion for final motion synthesis.
The results show our method outperforms previous works significantly and has great performance in handling dynamic environments.
arXiv Detail & Related papers (2024-03-04T05:38:16Z) - Dynamic Visual Reasoning by Learning Differentiable Physics Models from
Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language.
This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z) - SLAMP: Stochastic Latent Appearance and Motion Prediction [14.257878210585014]
Motion is an important cue for video prediction and often utilized by separating video content into static and dynamic components.
Most of the previous work utilizing motion is deterministic but there are methods that can model the inherent uncertainty of the future.
In this paper, we reason about appearance and motion in the videoally by predicting the future based on the motion history.
arXiv Detail & Related papers (2021-08-05T17:52:18Z) - Understanding Object Dynamics for Interactive Image-to-Video Synthesis [8.17925295907622]
We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level.
Our generative model learns to infer natural object dynamics as a response to user interaction.
In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos.
arXiv Detail & Related papers (2021-06-21T17:57:39Z) - Unsupervised Video Prediction from a Single Frame by Estimating 3D
Dynamic Scene Structure [42.3091008598491]
We develop a model that first estimates the latent 3D structure of the scene, including the segmentation of any moving objects.
It then predicts future frames by simulating the object and camera dynamics, and rendering the resulting views.
Experiments on two challenging datasets of natural videos show that our model can estimate 3D structure and motion segmentation from a single frame.
arXiv Detail & Related papers (2021-06-16T18:00:12Z) - Local Frequency Domain Transformer Networks for Video Prediction [24.126513851779936]
Video prediction is of interest not only in anticipating visual changes in the real world but has, above all, emerged as an unsupervised learning rule.
This paper proposes a fully differentiable building block that can perform all of those tasks separately while maintaining interpretability.
arXiv Detail & Related papers (2021-05-10T19:48:42Z) - Learning Semantic-Aware Dynamics for Video Prediction [68.04359321855702]
We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions.
The appearance of the scene is warped from past frames using the predicted motion in co-visible regions.
arXiv Detail & Related papers (2021-04-20T05:00:24Z) - Occlusion resistant learning of intuitive physics from videos [52.25308231683798]
Key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation.
This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences.
arXiv Detail & Related papers (2020-04-30T19:35:54Z) - Future Video Synthesis with Object Motion Prediction [54.31508711871764]
Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics.
The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects.
Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
arXiv Detail & Related papers (2020-04-01T16:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.