Learning Semantic-Aware Dynamics for Video Prediction
- URL: http://arxiv.org/abs/2104.09762v1
- Date: Tue, 20 Apr 2021 05:00:24 GMT
- Title: Learning Semantic-Aware Dynamics for Video Prediction
- Authors: Xinzhu Bei, Yanchao Yang, Stefano Soatto
- Abstract summary: We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions.
The appearance of the scene is warped from past frames using the predicted motion in co-visible regions.
- Score: 68.04359321855702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an architecture and training scheme to predict video frames by
explicitly modeling dis-occlusions and capturing the evolution of semantically
consistent regions in the video. The scene layout (semantic map) and motion
(optical flow) are decomposed into layers, which are predicted and fused with
their context to generate future layouts and motions. The appearance of the
scene is warped from past frames using the predicted motion in co-visible
regions; dis-occluded regions are synthesized with content-aware inpainting
utilizing the predicted scene layout. The result is a predictive model that
explicitly represents objects and learns their class-specific motion, which we
evaluate on video prediction benchmarks.
Related papers
- Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation [54.60804602905519]
We learn an entangled representation, aiming to model layered scene geometry, motion forecasting and novel view synthesis together.
Our approach chooses to disentangle scene geometry from scene motion, via lifting the 2D scene to 3D point clouds.
To model future 3D scene motion, we propose a disentangled two-stage approach that initially forecasts ego-motion and subsequently the residual motion of dynamic objects.
arXiv Detail & Related papers (2024-07-31T08:54:50Z) - Motion and Context-Aware Audio-Visual Conditioned Video Prediction [58.9467115916639]
We decouple the audio-visual conditioned video prediction into motion and appearance modeling.
The multimodal motion estimation predicts future optical flow based on the audio-motion correlation.
We propose context-aware refinement to address the diminishing of the global appearance context.
arXiv Detail & Related papers (2022-12-09T05:57:46Z) - Stochastic Video Prediction with Structure and Motion [14.424465835834042]
We propose to factorize video observations into static and dynamic components.
By learning separate distributions of changes in foreground and background, we can decompose the scene into static and dynamic parts.
Our experiments demonstrate that disentangling structure and motion helps video prediction, leading to better future predictions in complex driving scenarios.
arXiv Detail & Related papers (2022-03-20T11:29:46Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z) - Local Frequency Domain Transformer Networks for Video Prediction [24.126513851779936]
Video prediction is of interest not only in anticipating visual changes in the real world but has, above all, emerged as an unsupervised learning rule.
This paper proposes a fully differentiable building block that can perform all of those tasks separately while maintaining interpretability.
arXiv Detail & Related papers (2021-05-10T19:48:42Z) - Future Video Synthesis with Object Motion Prediction [54.31508711871764]
Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics.
The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects.
Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
arXiv Detail & Related papers (2020-04-01T16:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.