Motion and Context-Aware Audio-Visual Conditioned Video Prediction
- URL: http://arxiv.org/abs/2212.04679v3
- Date: Wed, 20 Sep 2023 11:58:10 GMT
- Title: Motion and Context-Aware Audio-Visual Conditioned Video Prediction
- Authors: Yating Xu, Conghui Hu, Gim Hee Lee
- Abstract summary: We decouple the audio-visual conditioned video prediction into motion and appearance modeling.
The multimodal motion estimation predicts future optical flow based on the audio-motion correlation.
We propose context-aware refinement to address the diminishing of the global appearance context.
- Score: 58.9467115916639
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The existing state-of-the-art method for audio-visual conditioned video
prediction uses the latent codes of the audio-visual frames from a multimodal
stochastic network and a frame encoder to predict the next visual frame.
However, a direct inference of per-pixel intensity for the next visual frame is
extremely challenging because of the high-dimensional image space. To this end,
we decouple the audio-visual conditioned video prediction into motion and
appearance modeling. The multimodal motion estimation predicts future optical
flow based on the audio-motion correlation. The visual branch recalls from the
motion memory built from the audio features to enable better long term
prediction. We further propose context-aware refinement to address the
diminishing of the global appearance context in the long-term continuous
warping. The global appearance context is extracted by the context encoder and
manipulated by motion-conditioned affine transformation before fusion with
features of warped frames. Experimental results show that our method achieves
competitive results on existing benchmarks.
Related papers
- Relevance-guided Audio Visual Fusion for Video Saliency Prediction [23.873134951154704]
We propose a novel relevance-guided audio-visual saliency prediction network dubbedSP.
The Fusion module dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements.
The Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales.
arXiv Detail & Related papers (2024-11-18T10:42:27Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones.
It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z) - Learning Semantic-Aware Dynamics for Video Prediction [68.04359321855702]
We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions.
The appearance of the scene is warped from past frames using the predicted motion in co-visible regions.
arXiv Detail & Related papers (2021-04-20T05:00:24Z) - Motion-blurred Video Interpolation and Extrapolation [72.3254384191509]
We present a novel framework for deblurring, interpolating and extrapolating sharp frames from a motion-blurred video in an end-to-end manner.
To ensure temporal coherence across predicted frames and address potential temporal ambiguity, we propose a simple, yet effective flow-based rule.
arXiv Detail & Related papers (2021-03-04T12:18:25Z) - Sound2Sight: Generating Visual Dynamics from Sound and Context [36.38300120482868]
We present Sound2Sight, a deep variational framework, that is trained to learn a per frame prior conditioned on a joint embedding of audio and past frames.
To improve the quality and coherence of the generated frames, we propose a multimodal discriminator.
Our experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality.
arXiv Detail & Related papers (2020-07-23T16:57:44Z) - Future Video Synthesis with Object Motion Prediction [54.31508711871764]
Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics.
The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects.
Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
arXiv Detail & Related papers (2020-04-01T16:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.