Related papers: Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

URL: http://arxiv.org/abs/2302.11850v2
Date: Mon, 31 Jul 2023 09:35:08 GMT
Title: Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions
Authors: Angel Villar-Corrales, Ismail Wahdan and Sven Behnke
Abstract summary: We propose a novel framework for the task of object-centric video prediction, i.e., extracting the structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations. With the goal of learning meaningful object representations, we propose two object-centric video predictor (OCVP) transformer modules, which de-couple processing of temporal dynamics and object interactions. In our experiments, we show how our object-centric prediction framework utilizing our OCVP predictors outperforms object-agnostic video prediction models on two different datasets.
Score: 27.112210225969733
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a novel framework for the task of object-centric video prediction, i.e., extracting the compositional structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations in order to predict the future object states, from which we can then generate subsequent video frames. With the goal of learning meaningful spatio-temporal object representations and accurately forecasting object states, we propose two novel object-centric video predictor (OCVP) transformer modules, which decouple the processing of temporal dynamics and object interactions, thus presenting an improved prediction performance. In our experiments, we show how our object-centric prediction framework utilizing our OCVP predictors outperforms object-agnostic video prediction models on two different datasets, while maintaining consistent and accurate object representations.

Related papers

VideoPCDNet: Video Parsing and Prediction with Phase Correlation Networks [14.933024847952618]
We present VideoPCDNet, an unsupervised framework for object-centric video decomposition and prediction.<n>Our model uses frequency-domain phase correlation techniques to parse videos into object components, which are represented as transformed versions of learned object prototypes.
arXiv Detail & Related papers (2025-06-24T13:39:47Z)
Object-Centric Image to Video Generation with Language Guidance [17.50161162624179]
TextOCVP is an object-centric model for image-to-video generation guided by textual descriptions. Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions.
arXiv Detail & Related papers (2025-02-17T10:46:47Z)
Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z)
Collaboratively Self-supervised Video Representation Learning for Action Recognition [58.195372471117615]
We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition. Our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2024-01-15T10:42:04Z)
Object-centric Video Representation for Long-term Action Anticipation [33.115854386196126]
Key motivation is that objects provide important cues to recognize and predict human-object interactions. We propose to build object-centric video representations by leveraging visual-language pretrained models. To recognize and predict human-object interactions, we use a Transformer-based neural architecture.
arXiv Detail & Related papers (2023-10-31T22:54:31Z)
Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA) We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions. Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z)
STDepthFormer: Predicting Spatio-temporal Depth from Video with a Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed. The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods. It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z)
Learn to Predict How Humans Manipulate Large-sized Objects from Interactive Motions [82.90906153293585]
We propose a graph neural network, HO-GCN, to fuse motion data and dynamic descriptors for the prediction task. We show the proposed network that consumes dynamic descriptors can achieve state-of-the-art prediction results and help the network better generalize to unseen objects.
arXiv Detail & Related papers (2022-06-25T09:55:39Z)
Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks. We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames. We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z)
Fourier-based Video Prediction through Relational Object Motion [28.502280038100167]
Deep recurrent architectures have been applied to the task of video prediction. Here, we explore a different approach by using frequency-domain approaches for video prediction. The resulting predictions are consistent with the observed dynamics in a scene and do not suffer from blur.
arXiv Detail & Related papers (2021-10-12T10:43:05Z)
Visual Relationship Forecasting in Videos [56.122037294234865]
We present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a manner of reasoning. Given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence. To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series oftemporally localized visual relation annotations in a video.
arXiv Detail & Related papers (2021-07-02T16:43:19Z)
Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. Our framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately. Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z)
Future Video Synthesis with Object Motion Prediction [54.31508711871764]
Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics. The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects. Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
arXiv Detail & Related papers (2020-04-01T16:09:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.