DualBEV: CNN is All You Need in View Transformation
- URL: http://arxiv.org/abs/2403.05402v1
- Date: Fri, 8 Mar 2024 15:58:00 GMT
- Title: DualBEV: CNN is All You Need in View Transformation
- Authors: Peidong Li, Wancheng Shen, Qihao Huang and Dixiao Cui
- Abstract summary: Camera-based Bird's-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT)
We propose DualBEV, a unified framework that utilizes a shared CNN-based feature transformation three probabilistic measurements for both strategies.
Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set.
- Score: 0.032771631221674334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Camera-based Bird's-Eye-View (BEV) perception often struggles between
adopting 3D-to-2D or 2D-to-3D view transformation (VT). The 3D-to-2D VT
typically employs resource intensive Transformer to establish robust
correspondences between 3D and 2D feature, while the 2D-to-3D VT utilizes the
Lift-Splat-Shoot (LSS) pipeline for real-time application, potentially missing
distant information. To address these limitations, we propose DualBEV, a
unified framework that utilizes a shared CNN-based feature transformation
incorporating three probabilistic measurements for both strategies. By
considering dual-view correspondences in one-stage, DualBEV effectively bridges
the gap between these strategies, harnessing their individual strengths. Our
method achieves state-of-the-art performance without Transformer, delivering
comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the
nuScenes test set. Code will be released at
https://github.com/PeidongLi/DualBEV.
Related papers
- Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - WidthFormer: Toward Efficient Transformer-based BEV View Transformation [23.055953867959744]
WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy.
We propose a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information.
Our model is highly efficient. For example, when using $256times 704$ input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 solutions.
arXiv Detail & Related papers (2024-01-08T11:50:23Z) - BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information.
We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z) - DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z) - MFEViT: A Robust Lightweight Transformer-based Network for Multimodal
2D+3D Facial Expression Recognition [1.7448845398590227]
Vision transformer (ViT) has been widely applied in many areas due to its self-attention mechanism.
We propose a robust lightweight pure transformer-based network for multimodal 2D+3D FER, namely MFEViT.
Our MFEViT outperforms state-of-the-art approaches with an accuracy of 90.83% on BU-3DFE and 90.28% on Bosphorus.
arXiv Detail & Related papers (2021-09-20T17:19:39Z) - Towards Fast, Accurate and Stable 3D Dense Face Alignment [73.01620081047336]
We propose a novel regression framework named 3DDFA-V2 which makes a balance among speed, accuracy and stability.
We present a virtual synthesis method to transform one still image to a short-video which incorporates in-plane and out-of-plane face moving.
arXiv Detail & Related papers (2020-09-21T15:37:37Z) - RangeRCNN: Towards Fast and Accurate 3D Object Detection with Range
Image Representation [35.6155506566957]
RangeRCNN is a novel and effective 3D object detection framework based on the range image representation.
In this paper, we utilize the dilated residual block (DRB) to better adapt different object scales and obtain a more flexible receptive field.
Experiments show that RangeRCNN achieves state-of-the-art performance on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2020-09-01T03:28:13Z) - Explainable 3D Convolutional Neural Networks by Learning Temporal
Transformations [6.477885112149906]
We introduce the temporally factorized 3D convolution (3TConv) as an interpretable alternative to the regular 3D convolution (3DConv)
In a 3TConv the 3D convolutional filter is obtained by learning a 2D filter and a set of temporal transformation parameters.
We demonstrate that 3TConv learns temporal transformations that afford a direct interpretation.
arXiv Detail & Related papers (2020-06-29T12:29:30Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.