Unifying Voxel-based Representation with Transformer for 3D Object
Detection
- URL: http://arxiv.org/abs/2206.00630v1
- Date: Wed, 1 Jun 2022 17:02:40 GMT
- Title: Unifying Voxel-based Representation with Transformer for 3D Object
Detection
- Authors: Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, Jiaya Jia
- Abstract summary: We present a unified framework for multi-modality 3D object detection, named UVTR.
The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection.
UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
- Score: 143.91910747605107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present a unified framework for multi-modality 3D object
detection, named UVTR. The proposed method aims to unify multi-modality
representations in the voxel space for accurate and robust single- or
cross-modality 3D detection. To this end, the modality-specific space is first
designed to represent different inputs in the voxel feature space. Different
from previous work, our approach preserves the voxel space without height
compression to alleviate semantic ambiguity and enable spatial interactions.
Benefit from the unified manner, cross-modality interaction is then proposed to
make full use of inherent properties from different sensors, including
knowledge transfer and modality fusion. In this way, geometry-aware expressions
in point clouds and context-rich features in images are well utilized for
better performance and robustness. The transformer decoder is applied to
efficiently sample features from the unified space with learnable positions,
which facilitates object-level interactions. In general, UVTR presents an early
attempt to represent different modalities in a unified framework. It surpasses
previous work in single- and multi-modality entries and achieves leading
performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for
LiDAR, camera, and multi-modality inputs, respectively. Code is made available
at https://github.com/dvlab-research/UVTR.
Related papers
- FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal
Consistent Transformer for 3D Object Detection [14.457844173630667]
We propose a novel end-to-end multi-modal fusion transformer-based framework, dubbed FusionFormer.
By developing a uniform sampling strategy, our method can easily sample from 2D image and 3D voxel features spontaneously.
Our method achieves state-of-the-art single model performance of 72.6% mAP and 75.1% NDS in the 3D object detection task without test time augmentation.
arXiv Detail & Related papers (2023-09-11T06:27:25Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving.
We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z) - Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object
Detection [16.198358858773258]
Multi-modal 3D object detection has been an active research topic in autonomous driving.
It is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels.
Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels.
arXiv Detail & Related papers (2022-10-18T06:15:56Z) - Voxel Field Fusion for 3D Object Detection [140.6941303279114]
We present a conceptually simple framework for cross-modality 3D object detection, named voxel field fusion.
The proposed approach aims to maintain cross-modality consistency by representing and fusing augmented image features as a ray in the voxel field.
The framework is demonstrated to achieve consistent gains in various benchmarks and outperforms previous fusion-based methods on KITTI and nuScenes datasets.
arXiv Detail & Related papers (2022-05-31T16:31:36Z) - Weakly Aligned Feature Fusion for Multimodal Object Detection [52.15436349488198]
multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned.
This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training.
In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
arXiv Detail & Related papers (2022-04-21T02:35:23Z) - Stereo RGB and Deeper LIDAR Based Network for 3D Object Detection [40.34710686994996]
3D object detection has become an emerging task in autonomous driving scenarios.
Previous works process 3D point clouds using either projection-based or voxel-based models.
We propose the Stereo RGB and Deeper LIDAR framework which can utilize semantic and spatial information simultaneously.
arXiv Detail & Related papers (2020-06-09T11:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.