Motion Segmentation using Frequency Domain Transformer Networks
- URL: http://arxiv.org/abs/2004.08638v1
- Date: Sat, 18 Apr 2020 15:05:11 GMT
- Title: Motion Segmentation using Frequency Domain Transformer Networks
- Authors: Hafez Farazi and Sven Behnke
- Abstract summary: We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately.
Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
- Score: 29.998917158604694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised prediction is a powerful mechanism to learn representations
that capture the underlying structure of the data. Despite recent progress, the
self-supervised video prediction task is still challenging. One of the critical
factors that make the task hard is motion segmentation, which is segmenting
individual objects and the background and estimating their motion separately.
In video prediction, the shape, appearance, and transformation of each object
should be understood only by predicting the next frame in pixel space. To
address this task, we propose a novel end-to-end learnable architecture that
predicts the next frame by modeling foreground and background separately while
simultaneously estimating and predicting the foreground motion using Frequency
Domain Transformer Networks. Experimental evaluations show that this yields
interpretable representations and that our approach can outperform some widely
used video prediction methods like Video Ladder Network and Predictive Gated
Pyramids on synthetic data.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark.
Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions.
We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z) - Semi-Weakly Supervised Object Kinematic Motion Prediction [56.282759127180306]
Given a 3D object, kinematic motion prediction aims to identify the mobile parts as well as the corresponding motion parameters.
We propose a graph neural network to learn the map between hierarchical part-level segmentation and mobile parts parameters.
The network predictions yield a large scale of 3D objects with pseudo labeled mobility information.
arXiv Detail & Related papers (2023-03-31T02:37:36Z) - STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z) - Object-Centric Video Prediction via Decoupling of Object Dynamics and
Interactions [27.112210225969733]
We propose a novel framework for the task of object-centric video prediction, i.e., extracting the structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations.
With the goal of learning meaningful object representations, we propose two object-centric video predictor (OCVP) transformer modules, which de-couple processing of temporal dynamics and object interactions.
In our experiments, we show how our object-centric prediction framework utilizing our OCVP predictors outperforms object-agnostic video prediction models on two different datasets.
arXiv Detail & Related papers (2023-02-23T08:29:26Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - Semantic Prediction: Which One Should Come First, Recognition or
Prediction? [21.466783934830925]
One of the primary downstream tasks is interpreting the scene's semantic composition and using it for decision-making.
There are two main ways to achieve the same outcome, given a pre-trained video prediction and pre-trained semantic extraction model.
We investigate these configurations using the Local Frequency Domain Transformer Network (LFDTN) as the video prediction model and U-Net as the semantic extraction model on synthetic and real datasets.
arXiv Detail & Related papers (2021-10-06T15:01:05Z) - Local Frequency Domain Transformer Networks for Video Prediction [24.126513851779936]
Video prediction is of interest not only in anticipating visual changes in the real world but has, above all, emerged as an unsupervised learning rule.
This paper proposes a fully differentiable building block that can perform all of those tasks separately while maintaining interpretability.
arXiv Detail & Related papers (2021-05-10T19:48:42Z) - Panoptic Segmentation Forecasting [71.75275164959953]
Our goal is to forecast the near future given a set of recent observations.
We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents.
We develop a two-component model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things.
arXiv Detail & Related papers (2021-04-08T17:59:16Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.