STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model
- URL: http://arxiv.org/abs/2303.01196v1
- Date: Thu, 2 Mar 2023 12:22:51 GMT
- Title: STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model
- Authors: Houssem Boulahbal, Adrian Voicila, Andrew Comport
- Abstract summary: Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, a self-supervised model that simultaneously predicts a
sequence of future frames from video-input with a novel spatial-temporal
attention (ST) network is proposed. The ST transformer network allows
constraining both temporal consistency across future frames whilst constraining
consistency across spatial objects in the image at different scales. This was
not the case in prior works for depth prediction, which focused on predicting a
single frame as output. The proposed model leverages prior scene knowledge such
as object shape and texture similar to single-image depth inference methods,
whilst also constraining the motion and geometry from a sequence of input
images. Apart from the transformer architecture, one of the main contributions
with respect to prior works lies in the objective function that enforces
spatio-temporal consistency across a sequence of output frames rather than a
single output frame. As will be shown, this results in more accurate and robust
depth sequence forecasting. The model achieves highly accurate depth
forecasting results that outperform existing baselines on the KITTI benchmark.
Extensive ablation studies were performed to assess the effectiveness of the
proposed techniques. One remarkable result of the proposed model is that it is
implicitly capable of forecasting the motion of objects in the scene, rather
than requiring complex models involving multi-object detection, segmentation
and tracking.
Related papers
- Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation [50.05866669110754]
Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames.<n>We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules.<n>We show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation.
arXiv Detail & Related papers (2025-12-19T15:15:58Z) - Flow and Depth Assisted Video Prediction with Latent Transformer [6.973908410173025]
We present the first systematic study dedicated to occluded video prediction.<n>We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow.<n>We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
arXiv Detail & Related papers (2025-11-20T15:54:33Z) - PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control [67.17998939712326]
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework.<n>At its core, PoseDiff maps raw visual observations into structured robot states-such as 3D keypoints or joint angles-from a single RGB image.<n>Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics.
arXiv Detail & Related papers (2025-09-29T10:55:48Z) - From Editor to Dense Geometry Estimator [77.21804448599009]
We introduce textbfFE2E, a framework that adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction.<n>FE2E achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$times$ data.
arXiv Detail & Related papers (2025-09-04T15:58:50Z) - Overcoming Semantic Dilution in Transformer-Based Next Frame Prediction [0.9776703963093367]
Next-frame prediction in videos is crucial for applications such as autonomous driving, object tracking, and motion prediction.
transformer-based next-frame prediction models face notable issues.
We propose a Semantic Concentration Multi-Head Self-Attention architecture, which effectively mitigates semantic dilution.
arXiv Detail & Related papers (2025-01-28T07:12:29Z) - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [55.77542145604758]
FoundationPose is a unified foundation model for 6D object pose estimation and tracking.
Our approach can be instantly applied at test-time to a novel object without fine-tuning.
arXiv Detail & Related papers (2023-12-13T18:28:09Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Forecasting of depth and ego-motion with transformers and
self-supervision [0.0]
This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion.
Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a supervised self photometric loss.
The architecture is designed using both convolution and transformer modules.
arXiv Detail & Related papers (2022-06-15T10:14:11Z) - Keypoint-Based Category-Level Object Pose Tracking from an RGB Sequence
with Uncertainty Estimation [29.06824085794294]
We propose a category-level 6-DoF pose estimation algorithm that simultaneously detects and tracks instances of objects within a known category.
Our method takes as input the previous and current frame from a monocular video RGB, as well as predictions from the previous frame, to predict the bounding cuboid and pose.
Our framework allows the system take previous uncertainties into consideration when predicting current frame, resulting in predictions that are more accurate stable than single frame methods.
arXiv Detail & Related papers (2022-05-23T05:20:22Z) - Instance-aware multi-object self-supervision for monocular depth
prediction [0.0]
This paper proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss.
Self-supervision is performed by warping the images across a video sequence using depth and scene motion including object instances.
arXiv Detail & Related papers (2022-03-02T00:59:25Z) - Panoptic Segmentation Forecasting [71.75275164959953]
Our goal is to forecast the near future given a set of recent observations.
We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents.
We develop a two-component model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things.
arXiv Detail & Related papers (2021-04-08T17:59:16Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately.
Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.