Related papers: Wide and Narrow: Video Prediction from Context and Motion

Wide and Narrow: Video Prediction from Context and Motion

URL: http://arxiv.org/abs/2110.11586v1
Date: Fri, 22 Oct 2021 04:35:58 GMT
Title: Wide and Narrow: Video Prediction from Context and Motion
Authors: Jaehoon Cho, Jiyoung Lee, Changjae Oh, Wonil Song, Kwanghoon Sohn
Abstract summary: We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks. We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames. We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
Score: 54.21624227408727
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video prediction, forecasting the future frames from a sequence of input frames, is a challenging task since the view changes are influenced by various factors, such as the global context surrounding the scene and local motion dynamics. In this paper, we propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks. We present global context propagation networks that iteratively aggregate the non-local neighboring representations to preserve the contextual information over the past frames. To capture the local motion pattern of objects, we also devise local filter memory networks that generate adaptive filter kernels by storing the prototypical motion of moving objects in the memory. The proposed framework, utilizing the outputs from both networks, can address blurry predictions and color distortion. We conduct experiments on Caltech pedestrian and UCF101 datasets, and demonstrate state-of-the-art results. Especially for multi-step prediction, we obtain an outstanding performance in quantitative and qualitative evaluation.

Related papers

GC-GAT: Multimodal Vehicular Trajectory Prediction using Graph Goal Conditioning and Cross-context Attention [0.0]
We present a lane graph-based motion prediction model that first predicts graph-based goal proposals and later fuses them with cross attention over multiple contextual elements. We evaluate our work on nuScenes motion prediction dataset, achieving state-of-the-art results.
arXiv Detail & Related papers (2025-04-15T12:53:07Z)
Object-Centric Image to Video Generation with Language Guidance [17.50161162624179]
TextOCVP is an object-centric model for image-to-video generation guided by textual descriptions. Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions.
arXiv Detail & Related papers (2025-02-17T10:46:47Z)
Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output. Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion. We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z)
Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions [27.112210225969733]
We propose a novel framework for the task of object-centric video prediction, i.e., extracting the structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations. With the goal of learning meaningful object representations, we propose two object-centric video predictor (OCVP) transformer modules, which de-couple processing of temporal dynamics and object interactions. In our experiments, we show how our object-centric prediction framework utilizing our OCVP predictors outperforms object-agnostic video prediction models on two different datasets.
arXiv Detail & Related papers (2023-02-23T08:29:26Z)
A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc. Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z)
Motion and Context-Aware Audio-Visual Conditioned Video Prediction [58.9467115916639]
We decouple the audio-visual conditioned video prediction into motion and appearance modeling. The multimodal motion estimation predicts future optical flow based on the audio-motion correlation. We propose context-aware refinement to address the diminishing of the global appearance context.
arXiv Detail & Related papers (2022-12-09T05:57:46Z)
Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP) Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z)
Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames. Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z)
Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video. In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)
Local Frequency Domain Transformer Networks for Video Prediction [24.126513851779936]
Video prediction is of interest not only in anticipating visual changes in the real world but has, above all, emerged as an unsupervised learning rule. This paper proposes a fully differentiable building block that can perform all of those tasks separately while maintaining interpretability.
arXiv Detail & Related papers (2021-05-10T19:48:42Z)
Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately. Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z)
Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. We combine the resulting representations with global scene information for accurately predicting visual saliency. Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.