Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers
- URL: http://arxiv.org/abs/2312.14919v3
- Date: Tue, 21 May 2024 16:47:13 GMT
- Title: Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers
- Authors: James Gunn, Zygmunt Lenyk, Anuj Sharma, Andrea Donati, Alexandru Buburuzan, John Redford, Romain Mueller,
- Abstract summary: We introduce a novel fusion method that bypasses monocular depth estimation altogether.
We show that our model can modulate its use of camera features based on the availability of lidar features.
- Score: 39.14931758754381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Combining complementary sensor modalities is crucial to providing robust perception for safety-critical robotics applications such as autonomous driving (AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on monocular depth estimation which is a notoriously difficult task compared to using depth information from the lidar directly. Here, we find that this approach does not leverage depth as expected and show that naively improving depth estimation does not lead to improvements in object detection performance. Strikingly, we also find that removing depth estimation altogether does not degrade object detection performance substantially, suggesting that relying on monocular depth could be an unnecessary architectural bottleneck during camera-lidar fusion. In this work, we introduce a novel fusion method that bypasses monocular depth estimation altogether and instead selects and fuses camera and lidar features in a bird's-eye-view grid using a simple attention mechanism. We show that our model can modulate its use of camera features based on the availability of lidar features and that it yields better 3D object detection on the nuScenes dataset than baselines relying on monocular depth estimation.
Related papers
- Instance-aware Multi-Camera 3D Object Detection with Structural Priors
Mining and Self-Boosting Learning [93.71280187657831]
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
arXiv Detail & Related papers (2023-12-13T09:24:42Z) - DeepFusion: A Robust and Modular 3D Object Detector for Lidars, Cameras
and Radars [2.2166853714891057]
We propose a modular multi-modal architecture to fuse lidars, cameras and radars in different combinations for 3D object detection.
Specialized feature extractors take advantage of each modality and can be exchanged easily, making the approach simple and flexible.
Experimental results for lidar-camera, lidar-camera-radar and camera-radar fusion show the flexibility and effectiveness of our fusion approach.
arXiv Detail & Related papers (2022-09-26T14:33:30Z) - Uncertainty Guided Depth Fusion for Spike Camera [49.41822923588663]
We propose a novel Uncertainty-Guided Depth Fusion (UGDF) framework to fuse predictions of monocular and stereo depth estimation networks for spike camera.
Our framework is motivated by the fact that stereo spike depth estimation achieves better results at close range.
In order to demonstrate the advantage of spike depth estimation over traditional camera depth estimation, we contribute a spike-depth dataset named CitySpike20K.
arXiv Detail & Related papers (2022-08-26T13:04:01Z) - Depth Estimation Matters Most: Improving Per-Object Depth Estimation for
Monocular 3D Detection and Tracking [47.59619420444781]
Approaches to monocular 3D perception including detection and tracking often yield inferior performance when compared to LiDAR-based techniques.
We propose a multi-level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across multiple frames for objects (tracklets) to enhance per-object depth estimation.
arXiv Detail & Related papers (2022-06-08T03:37:59Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - CamLessMonoDepth: Monocular Depth Estimation with Unknown Camera
Parameters [1.7499351967216341]
Recent advances in monocular depth estimation have shown that gaining such knowledge from a single camera input is possible by training deep neural networks to predict inverse depth and pose, without the necessity of ground truth data.
In this work, we propose a method for implicit estimation of pinhole camera intrinsics along with depth and pose, by learning from monocular image sequences alone.
arXiv Detail & Related papers (2021-10-27T10:54:15Z) - Geometry-aware data augmentation for monocular 3D object detection [18.67567745336633]
This paper focuses on monocular 3D object detection, one of the essential modules in autonomous driving systems.
A key challenge is that the depth recovery problem is ill-posed in monocular data.
We conduct a thorough analysis to reveal how existing methods fail to robustly estimate depth when different geometry shifts occur.
We convert the aforementioned manipulations into four corresponding 3D-aware data augmentation techniques.
arXiv Detail & Related papers (2021-04-12T23:12:48Z) - SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation
Synergized with Semantic Segmentation for Autonomous Driving [37.50089104051591]
State-of-the-art self-supervised learning approaches for monocular depth estimation usually suffer from scale ambiguity.
This paper introduces a novel multi-task learning strategy to improve self-supervised monocular distance estimation on fisheye and pinhole camera images.
arXiv Detail & Related papers (2020-08-10T10:52:47Z) - Video Depth Estimation by Fusing Flow-to-Depth Proposals [65.24533384679657]
We present an approach with a differentiable flow-to-depth layer for video depth estimation.
The model consists of a flow-to-depth layer, a camera pose refinement module, and a depth fusion network.
Our approach outperforms state-of-the-art depth estimation methods, and has reasonable cross dataset generalization capability.
arXiv Detail & Related papers (2019-12-30T10:45:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.