VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view
Attention for Multi-view 3D Object Detection
- URL: http://arxiv.org/abs/2304.01054v1
- Date: Mon, 3 Apr 2023 15:00:36 GMT
- Title: VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view
Attention for Multi-view 3D Object Detection
- Authors: Zhuoling Li, Chuanrui Zhang, Wei-Chiu Ma, Yipin Zhou, Linyan Huang,
Haoqian Wang, SerNam Lim, Hengshuang Zhao
- Abstract summary: transformer-based detectors have demonstrated remarkable performance in 2D visual perception tasks.
However, their performance in multi-view 3D object detection remains inferior to the state-of-the-art (SOTA) of convolutional neural network based detectors.
We propose a novel BEV feature generation method, dual-view attention, which generates attention weights from both the BEV and camera view.
- Score: 47.926010021559314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, transformer-based detectors have demonstrated remarkable
performance in 2D visual perception tasks. However, their performance in
multi-view 3D object detection remains inferior to the state-of-the-art (SOTA)
of convolutional neural network based detectors. In this work, we investigate
this issue from the perspective of bird's-eye-view (BEV) feature generation.
Specifically, we examine the BEV feature generation method employed by the
transformer-based SOTA, BEVFormer, and identify its two limitations: (i) it
only generates attention weights from BEV, which precludes the use of lidar
points for supervision, and (ii) it aggregates camera view features to the BEV
through deformable sampling, which only selects a small subset of features and
fails to exploit all information. To overcome these limitations, we propose a
novel BEV feature generation method, dual-view attention, which generates
attention weights from both the BEV and camera view. This method encodes all
camera features into the BEV feature. By combining dual-view attention with the
BEVFormer architecture, we build a new detector named VoxelFormer. Extensive
experiments are conducted on the nuScenes benchmark to verify the superiority
of dual-view attention and VoxelForer. We observe that even only adopting 3
encoders and 1 historical frame during training, VoxelFormer still outperforms
BEVFormer significantly. When trained in the same setting, VoxelFormer can
surpass BEVFormer by 4.9% NDS point. Code is available at:
https://github.com/Lizhuoling/VoxelFormer-public.git.
Related papers
- DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception [104.87876441265593]
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space.
Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored.
We design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features.
arXiv Detail & Related papers (2024-01-13T04:21:24Z) - SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view
3D Object Detection [46.92706423094971]
We propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features.
We also propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature.
Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-21T10:28:19Z) - OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for
Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework.
Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks [28.024042528077125]
Bird's-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems.
We propose a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights.
arXiv Detail & Related papers (2022-12-02T15:14:48Z) - BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View
Recognition via Perspective Supervision [101.36648828734646]
We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones.
The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset.
arXiv Detail & Related papers (2022-11-18T18:59:48Z) - Delving into the Devils of Bird's-eye-view Perception: A Review,
Evaluation and Recipe [115.31507979199564]
Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia.
As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.
The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios.
arXiv Detail & Related papers (2022-09-12T15:29:13Z) - PersDet: Monocular 3D Detection in Perspective Bird's-Eye-View [26.264139933212892]
Bird's-Eye-View (BEV) is superior to other 3D detectors for autonomous driving and robotics.
transforming image features into BEV necessitates special operators to conduct feature sampling.
We propose detecting objects in perspective BEV -- a new BEV representation that does not require feature sampling.
arXiv Detail & Related papers (2022-08-19T15:19:20Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.