SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation
- URL: http://arxiv.org/abs/2204.03636v1
- Date: Thu, 7 Apr 2022 17:58:47 GMT
- Title: SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation
- Authors: Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Yongming Rao, Guan
Huang, Jiwen Lu, Jie Zhou
- Abstract summary: We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
- Score: 101.55622133406446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Depth estimation from images serves as the fundamental step of 3D perception
for autonomous driving and is an economical alternative to expensive depth
sensors like LiDAR. The temporal photometric consistency enables
self-supervised depth estimation without labels, further facilitating its
application. However, most existing methods predict the depth solely based on
each monocular image and ignore the correlations among multiple surrounding
cameras, which are typically available for modern self-driving vehicles. In
this paper, we propose a SurroundDepth method to incorporate the information
from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views
and propose a cross-view transformer to effectively fuse the information from
multiple views. We apply cross-view self-attention to efficiently enable the
global interactions between multi-camera feature maps. Different from
self-supervised monocular depth estimation, we are able to predict real-world
scales given multi-camera extrinsic matrices. To achieve this goal, we adopt
structure-from-motion to extract scale-aware pseudo depths to pretrain the
models. Further, instead of predicting the ego-motion of each individual
camera, we estimate a universal ego-motion of the vehicle and transfer it to
each view to achieve multi-view consistency. In experiments, our method
achieves the state-of-the-art performance on the challenging multi-camera depth
estimation datasets DDAD and nuScenes.
Related papers
- SDGE: Stereo Guided Depth Estimation for 360$^\circ$ Camera Sets [65.64958606221069]
Multi-camera systems are often used in autonomous driving to achieve a 360$circ$ perception.
These 360$circ$ camera sets often have limited or low-quality overlap regions, making multi-view stereo methods infeasible for the entire image.
We propose the Stereo Guided Depth Estimation (SGDE) method, which enhances depth estimation of the full image by explicitly utilizing multi-view stereo results on the overlap.
arXiv Detail & Related papers (2024-02-19T02:41:37Z) - Robust Self-Supervised Extrinsic Self-Calibration [25.727912226753247]
Multi-camera self-supervised monocular depth estimation from videos is a promising way to reason about the environment.
We introduce a novel method for extrinsic calibration that builds upon the principles of self-supervised monocular depth and ego-motion learning.
arXiv Detail & Related papers (2023-08-04T06:20:20Z) - A Simple Baseline for Supervised Surround-view Depth Estimation [25.81521612343612]
We propose S3Depth, a Simple Baseline for Supervised Surround-view Depth Estimation.
We employ a global-to-local feature extraction module which combines CNN with transformer layers for enriched representations.
Our method achieves superior performance over existing state-of-the-art methods on both DDAD and nuScenes datasets.
arXiv Detail & Related papers (2023-03-14T10:06:19Z) - Multi-Camera Collaborative Depth Prediction via Consistent Structure
Estimation [75.99435808648784]
We propose a novel multi-camera collaborative depth prediction method.
It does not require large overlapping areas while maintaining structure consistency between cameras.
Experimental results on DDAD and NuScenes datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2022-10-05T03:44:34Z) - CrossDTR: Cross-view and Depth-guided Transformers for 3D Object
Detection [10.696619570924778]
We propose Cross-view and Depth-guided Transformers for 3D Object Detection, CrossDTR.
Our method hugely surpassed existing multi-camera methods by 10 percent in pedestrian detection and about 3 percent in overall mAP and NDS metrics.
arXiv Detail & Related papers (2022-09-27T16:23:12Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - SVDistNet: Self-Supervised Near-Field Distance Estimation on Surround
View Fisheye Cameras [30.480562747903186]
A 360deg perception of scene geometry is essential for automated driving, notably for parking and urban driving scenarios.
We present novel camera-geometry adaptive multi-scale convolutions which utilize the camera parameters as a conditional input.
We evaluate our approach on the Fisheye WoodScape surround-view dataset, significantly improving over previous approaches.
arXiv Detail & Related papers (2021-04-09T15:20:20Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z) - Full Surround Monodepth from Multiple Cameras [31.145598985137468]
We extend self-supervised monocular depth and ego-motion estimation to large photo-baseline multi-camera rigs.
We learn a single network generating dense, consistent, and scale-aware point clouds that cover the same full surround 360 degree field of view as a typical LiDAR scanner.
arXiv Detail & Related papers (2021-03-31T22:52:04Z) - Video Depth Estimation by Fusing Flow-to-Depth Proposals [65.24533384679657]
We present an approach with a differentiable flow-to-depth layer for video depth estimation.
The model consists of a flow-to-depth layer, a camera pose refinement module, and a depth fusion network.
Our approach outperforms state-of-the-art depth estimation methods, and has reasonable cross dataset generalization capability.
arXiv Detail & Related papers (2019-12-30T10:45:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.