Multimodal Scale Consistency and Awareness for Monocular Self-Supervised
Depth Estimation
- URL: http://arxiv.org/abs/2103.02451v1
- Date: Wed, 3 Mar 2021 15:39:41 GMT
- Title: Multimodal Scale Consistency and Awareness for Monocular Self-Supervised
Depth Estimation
- Authors: Hemang Chawla, Arnav Varma, Elahe Arani, Bahram Zonooz
- Abstract summary: Self-supervised approaches on monocular videos suffer from scale-inconsistency across long sequences.
We propose a dynamically-weighted GPS-to-Scale (g2s) loss to complement the appearance-based losses.
We demonstrate scale-consistent and -aware depth estimation during inference, improving the performance even when training with low-frequency GPS data.
- Score: 1.1470070927586016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dense depth estimation is essential to scene-understanding for autonomous
driving. However, recent self-supervised approaches on monocular videos suffer
from scale-inconsistency across long sequences. Utilizing data from the
ubiquitously copresent global positioning systems (GPS), we tackle this
challenge by proposing a dynamically-weighted GPS-to-Scale (g2s) loss to
complement the appearance-based losses. We emphasize that the GPS is needed
only during the multimodal training, and not at inference. The relative
distance between frames captured through the GPS provides a scale signal that
is independent of the camera setup and scene distribution, resulting in richer
learned feature representations. Through extensive evaluation on multiple
datasets, we demonstrate scale-consistent and -aware depth estimation during
inference, improving the performance even when training with low-frequency GPS
data.
Related papers
- Self-supervised Monocular Depth Estimation with Large Kernel Attention [30.44895226042849]
We propose a self-supervised monocular depth estimation network to get finer details.
Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies.
Our method achieves competitive results on the KITTI dataset.
arXiv Detail & Related papers (2024-09-26T14:44:41Z) - G-MEMP: Gaze-Enhanced Multimodal Ego-Motion Prediction in Driving [71.9040410238973]
We focus on inferring the ego trajectory of a driver's vehicle using their gaze data.
Next, we develop G-MEMP, a novel multimodal ego-trajectory prediction network that combines GPS and video input with gaze data.
The results show that G-MEMP significantly outperforms state-of-the-art methods in both benchmarks.
arXiv Detail & Related papers (2023-12-13T23:06:30Z) - Compression of GPS Trajectories using Autoencoders [6.044912425856236]
We present an lstm-autoencoder based approach in order to compress and reconstruct GPS trajectories.
The performance is compared to other trajectory compression algorithms, i.e., Douglas-Peucker.
arXiv Detail & Related papers (2023-01-18T10:32:53Z) - Unsupervised Visual Odometry and Action Integration for PointGoal
Navigation in Indoor Environment [14.363948775085534]
PointGoal navigation in indoor environment is a fundamental task for personal robots to navigate to a specified point.
To improve the PointGoal navigation accuracy without GPS signal, we use visual odometry (VO) and propose a novel action integration module (AIM) trained in unsupervised manner.
Experiments show that the proposed system achieves satisfactory results and outperforms the partially supervised learning algorithms on the popular Gibson dataset.
arXiv Detail & Related papers (2022-10-02T03:12:03Z) - Scalable and Real-time Multi-Camera Vehicle Detection,
Re-Identification, and Tracking [58.95210121654722]
We propose a real-time city-scale multi-camera vehicle tracking system that handles real-world, low-resolution CCTV instead of idealized and curated video streams.
Our method is ranked among the top five performers on the public leaderboard.
arXiv Detail & Related papers (2022-04-15T12:47:01Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - SelfTune: Metrically Scaled Monocular Depth Estimation through
Self-Supervised Learning [53.78813049373321]
We propose a self-supervised learning method for the pre-trained supervised monocular depth networks to enable metrically scaled depth estimation.
Our approach is useful for various applications such as mobile robot navigation and is applicable to diverse environments.
arXiv Detail & Related papers (2022-03-10T12:28:42Z) - DeepScale: An Online Frame Size Adaptation Framework to Accelerate
Visual Multi-object Tracking [8.878656943106934]
DeepScale is a model agnostic frame size selection approach to accelerate tracking throughput.
It can find a suitable trade-off between tracking accuracy and speed by adapting frame sizes at run time.
Compared to a state-of-the-art tracker, DeepScale++, a variant of DeepScale achieves 1.57X accelerated with only moderate degradation.
arXiv Detail & Related papers (2021-07-22T00:12:58Z) - Unsupervised Scale-consistent Depth Learning from Video [131.3074342883371]
We propose a monocular depth estimator SC-Depth, which requires only unlabelled videos for training.
Thanks to the capability of scale-consistent prediction, we show that our monocular-trained deep networks are readily integrated into the ORB-SLAM2 system.
The proposed hybrid Pseudo-RGBD SLAM shows compelling results in KITTI, and it generalizes well to the KAIST dataset without additional training.
arXiv Detail & Related papers (2021-05-25T02:17:56Z) - Unsupervised Monocular Depth Learning with Integrated Intrinsics and
Spatio-Temporal Constraints [61.46323213702369]
This work presents an unsupervised learning framework that is able to predict at-scale depth maps and egomotion.
Our results demonstrate strong performance when compared to the current state-of-the-art on multiple sequences of the KITTI driving dataset.
arXiv Detail & Related papers (2020-11-02T22:26:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.