Forecasting of depth and ego-motion with transformers and
self-supervision
- URL: http://arxiv.org/abs/2206.07435v1
- Date: Wed, 15 Jun 2022 10:14:11 GMT
- Title: Forecasting of depth and ego-motion with transformers and
self-supervision
- Authors: Houssem Boulahbal, Adrian Voicila and Andrew Comport
- Abstract summary: This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion.
Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a supervised self photometric loss.
The architecture is designed using both convolution and transformer modules.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of end-to-end self-supervised forecasting of
depth and ego motion. Given a sequence of raw images, the aim is to forecast
both the geometry and ego-motion using a self supervised photometric loss. The
architecture is designed using both convolution and transformer modules. This
leverages the benefits of both modules: Inductive bias of CNN, and the
multi-head attention of transformers, thus enabling a rich spatio-temporal
representation that enables accurate depth forecasting. Prior work attempts to
solve this problem using multi-modal input/output with supervised ground-truth
data which is not practical since a large annotated dataset is required.
Alternatively to prior methods, this paper forecasts depth and ego motion using
only self-supervised raw images as input. The approach performs significantly
well on the KITTI dataset benchmark with several performance criteria being
even comparable to prior non-forecasting self-supervised monocular depth
inference methods.
Related papers
- PPEA-Depth: Progressive Parameter-Efficient Adaptation for
Self-Supervised Monocular Depth Estimation [24.68378829544394]
We propose PPEA-Depth, a Progressive Efficient Adaptation approach to transfer a pre-trained image model for self-supervised depth estimation.
The training comprises two sequential stages: an initial phase trained on a dataset primarily composed of static scenes, succeeded by an expansion to more intricate datasets.
Experiments demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, CityScapes and DDAD datasets.
arXiv Detail & Related papers (2023-12-20T14:45:57Z) - STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular
Depth Estimation [33.018300966769516]
Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image.
Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales.
It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
arXiv Detail & Related papers (2022-11-20T20:00:21Z) - DeepRM: Deep Recurrent Matching for 6D Pose Refinement [77.34726150561087]
DeepRM is a novel recurrent network architecture for 6D pose refinement.
The architecture incorporates LSTM units to propagate information through each refinement step.
DeepRM achieves state-of-the-art performance on two widely accepted challenging datasets.
arXiv Detail & Related papers (2022-05-28T16:18:08Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames.
Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction.
We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.