Related papers: WidthFormer: Toward Efficient Transformer-based BEV View Transformation

WidthFormer: Toward Efficient Transformer-based BEV View Transformation

URL: http://arxiv.org/abs/2401.03836v5
Date: Tue, 30 Jul 2024 10:36:51 GMT
Title: WidthFormer: Toward Efficient Transformer-based BEV View Transformation
Authors: Chenhongyi Yang, Tianwei Lin, Lichao Huang, Elliot J. Crowley,
Abstract summary: WidthFormer is a transformer-based module to compute Bird's-Eye-View (BEV) representations from multi-view cameras for real-time autonomous-driving applications. We first introduce a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information. We then develop two modules to compensate for potential information loss due to feature compression.
Score: 21.10523575080856
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present WidthFormer, a novel transformer-based module to compute Bird's-Eye-View (BEV) representations from multi-view cameras for real-time autonomous-driving applications. WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy. We first introduce a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information, which enables our model to compute high-quality BEV representations with only a single transformer decoder layer. This mechanism is also beneficial for existing sparse 3D object detectors. Inspired by the recently proposed works, we further improve our model's efficiency by vertically compressing the image features when serving as attention keys and values, and then we develop two modules to compensate for potential information loss due to feature compression. Experimental evaluation on the widely-used nuScenes 3D object detection benchmark demonstrates that our method outperforms previous approaches across different 3D detection architectures. More importantly, our model is highly efficient. For example, when using $256\times 704$ input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 computation solutions. Furthermore, WidthFormer also exhibits strong robustness to different degrees of camera perturbations. Our study offers valuable insights into the deployment of BEV transformation methods in real-world, complex road environments. Code is available at https://github.com/ChenhongyiYang/WidthFormer .

Related papers

EVT: Efficient View Transformation for Multi-Modal 3D Object Detection [2.9848894641223302]
We propose a novel 3D object detector via efficient view transformation (EVT) EVT uses Adaptive Sampling and Adaptive Projection (ASAP) to generate 3D sampling points and adaptive kernels. It is designed to effectively utilize the obtained multi-modal BEV features within the transformer decoder.
arXiv Detail & Related papers (2024-11-16T06:11:10Z)
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation [51.19871052619077]
We introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images. We maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.
arXiv Detail & Related papers (2024-02-07T17:57:03Z)
BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information. We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z)
Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z)
Generative Multiplane Neural Radiance for 3D-Aware Image Generation [102.15322193381617]
We present a method to efficiently generate 3D-aware high-resolution images that are view-consistent across multiple target views. Our GMNR model generates 3D-aware images of 1024 X 1024 pixels with 17.6 FPS on a single V100.
arXiv Detail & Related papers (2023-04-03T17:41:20Z)
AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection [17.526914782562528]
We propose AutoAlignV2, a faster and stronger multi-modal 3D detection framework, built on top of AutoAlign. Our best model reaches 72.4 NDS on nuScenes test leaderboard, achieving new state-of-the-art results.
arXiv Detail & Related papers (2022-07-21T06:17:23Z)
M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation. M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z)
RangeRCNN: Towards Fast and Accurate 3D Object Detection with Range Image Representation [35.6155506566957]
RangeRCNN is a novel and effective 3D object detection framework based on the range image representation. In this paper, we utilize the dilated residual block (DRB) to better adapt different object scales and obtain a more flexible receptive field. Experiments show that RangeRCNN achieves state-of-the-art performance on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2020-09-01T03:28:13Z)
Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.