STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model
- URL: http://arxiv.org/abs/2303.01196v1
- Date: Thu, 2 Mar 2023 12:22:51 GMT
- Title: STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model
- Authors: Houssem Boulahbal, Adrian Voicila, Andrew Comport
- Abstract summary: Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, a self-supervised model that simultaneously predicts a
sequence of future frames from video-input with a novel spatial-temporal
attention (ST) network is proposed. The ST transformer network allows
constraining both temporal consistency across future frames whilst constraining
consistency across spatial objects in the image at different scales. This was
not the case in prior works for depth prediction, which focused on predicting a
single frame as output. The proposed model leverages prior scene knowledge such
as object shape and texture similar to single-image depth inference methods,
whilst also constraining the motion and geometry from a sequence of input
images. Apart from the transformer architecture, one of the main contributions
with respect to prior works lies in the objective function that enforces
spatio-temporal consistency across a sequence of output frames rather than a
single output frame. As will be shown, this results in more accurate and robust
depth sequence forecasting. The model achieves highly accurate depth
forecasting results that outperform existing baselines on the KITTI benchmark.
Extensive ablation studies were performed to assess the effectiveness of the
proposed techniques. One remarkable result of the proposed model is that it is
implicitly capable of forecasting the motion of objects in the scene, rather
than requiring complex models involving multi-object detection, segmentation
and tracking.
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [55.77542145604758]
FoundationPose is a unified foundation model for 6D object pose estimation and tracking.
Our approach can be instantly applied at test-time to a novel object without fine-tuning.
arXiv Detail & Related papers (2023-12-13T18:28:09Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Forecasting of depth and ego-motion with transformers and
self-supervision [0.0]
This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion.
Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a supervised self photometric loss.
The architecture is designed using both convolution and transformer modules.
arXiv Detail & Related papers (2022-06-15T10:14:11Z) - Keypoint-Based Category-Level Object Pose Tracking from an RGB Sequence
with Uncertainty Estimation [29.06824085794294]
We propose a category-level 6-DoF pose estimation algorithm that simultaneously detects and tracks instances of objects within a known category.
Our method takes as input the previous and current frame from a monocular video RGB, as well as predictions from the previous frame, to predict the bounding cuboid and pose.
Our framework allows the system take previous uncertainties into consideration when predicting current frame, resulting in predictions that are more accurate stable than single frame methods.
arXiv Detail & Related papers (2022-05-23T05:20:22Z) - Instance-aware multi-object self-supervision for monocular depth
prediction [0.0]
This paper proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss.
Self-supervision is performed by warping the images across a video sequence using depth and scene motion including object instances.
arXiv Detail & Related papers (2022-03-02T00:59:25Z) - Panoptic Segmentation Forecasting [71.75275164959953]
Our goal is to forecast the near future given a set of recent observations.
We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents.
We develop a two-component model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things.
arXiv Detail & Related papers (2021-04-08T17:59:16Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately.
Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.