WidthFormer: Toward Efficient Transformer-based BEV View Transformation
- URL: http://arxiv.org/abs/2401.03836v4
- Date: Mon, 15 Jan 2024 15:54:56 GMT
- Title: WidthFormer: Toward Efficient Transformer-based BEV View Transformation
- Authors: Chenhongyi Yang, Tianwei Lin, Lichao Huang and Elliot J. Crowley
- Abstract summary: WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy.
We propose a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information.
Our model is highly efficient. For example, when using $256times 704$ input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 solutions.
- Score: 23.055953867959744
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this work, we present WidthFormer, a novel transformer-based
Bird's-Eye-View (BEV) 3D detection method tailored for real-time
autonomous-driving applications. WidthFormer is computationally efficient,
robust and does not require any special engineering effort to deploy. In this
work, we propose a novel 3D positional encoding mechanism capable of accurately
encapsulating 3D geometric information, which enables our model to generate
high-quality BEV representations with only a single transformer decoder layer.
This mechanism is also beneficial for existing sparse 3D object detectors.
Inspired by the recently-proposed works, we further improve our model's
efficiency by vertically compressing the image features when serving as
attention keys and values. We also introduce two modules to compensate for
potential information loss due to feature compression. Experimental evaluation
on the widely-used nuScenes 3D object detection benchmark demonstrates that our
method outperforms previous approaches across different 3D detection
architectures. More importantly, our model is highly efficient. For example,
when using $256\times 704$ input images, it achieves 1.5 ms and 2.8 ms latency
on NVIDIA 3090 GPU and Horizon Journey-5 computation solutions, respectively.
Furthermore, WidthFormer also exhibits strong robustness to different degrees
of camera perturbations. Our study offers valuable insights into the deployment
of BEV transformation methods in real-world, complex road environments. Code is
available at https://github.com/ChenhongyiYang/WidthFormer .
Related papers
- Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - DualBEV: CNN is All You Need in View Transformation [0.032771631221674334]
Camera-based Bird's-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT)
We propose DualBEV, a unified framework that utilizes a shared CNN-based feature transformation three probabilistic measurements for both strategies.
Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set.
arXiv Detail & Related papers (2024-03-08T15:58:00Z) - BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information.
We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Generative Multiplane Neural Radiance for 3D-Aware Image Generation [102.15322193381617]
We present a method to efficiently generate 3D-aware high-resolution images that are view-consistent across multiple target views.
Our GMNR model generates 3D-aware images of 1024 X 1024 pixels with 17.6 FPS on a single V100.
arXiv Detail & Related papers (2023-04-03T17:41:20Z) - AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D
Object Detection [17.526914782562528]
We propose AutoAlignV2, a faster and stronger multi-modal 3D detection framework, built on top of AutoAlign.
Our best model reaches 72.4 NDS on nuScenes test leaderboard, achieving new state-of-the-art results.
arXiv Detail & Related papers (2022-07-21T06:17:23Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z) - RangeRCNN: Towards Fast and Accurate 3D Object Detection with Range
Image Representation [35.6155506566957]
RangeRCNN is a novel and effective 3D object detection framework based on the range image representation.
In this paper, we utilize the dilated residual block (DRB) to better adapt different object scales and obtain a more flexible receptive field.
Experiments show that RangeRCNN achieves state-of-the-art performance on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2020-09-01T03:28:13Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.