DualBEV: CNN is All You Need in View Transformation
- URL: http://arxiv.org/abs/2403.05402v1
- Date: Fri, 8 Mar 2024 15:58:00 GMT
- Title: DualBEV: CNN is All You Need in View Transformation
- Authors: Peidong Li, Wancheng Shen, Qihao Huang and Dixiao Cui
- Abstract summary: Camera-based Bird's-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT)
We propose DualBEV, a unified framework that utilizes a shared CNN-based feature transformation three probabilistic measurements for both strategies.
Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set.
- Score: 0.032771631221674334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Camera-based Bird's-Eye-View (BEV) perception often struggles between
adopting 3D-to-2D or 2D-to-3D view transformation (VT). The 3D-to-2D VT
typically employs resource intensive Transformer to establish robust
correspondences between 3D and 2D feature, while the 2D-to-3D VT utilizes the
Lift-Splat-Shoot (LSS) pipeline for real-time application, potentially missing
distant information. To address these limitations, we propose DualBEV, a
unified framework that utilizes a shared CNN-based feature transformation
incorporating three probabilistic measurements for both strategies. By
considering dual-view correspondences in one-stage, DualBEV effectively bridges
the gap between these strategies, harnessing their individual strengths. Our
method achieves state-of-the-art performance without Transformer, delivering
comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the
nuScenes test set. Code will be released at
https://github.com/PeidongLi/DualBEV.
Related papers
- EVT: Efficient View Transformation for Multi-Modal 3D Object Detection [2.9848894641223302]
We propose a novel 3D object detector via efficient view transformation (EVT)
EVT uses Adaptive Sampling and Adaptive Projection (ASAP) to generate 3D sampling points and adaptive kernels.
It is designed to effectively utilize the obtained multi-modal BEV features within the transformer decoder.
arXiv Detail & Related papers (2024-11-16T06:11:10Z) - Cross-D Conv: Cross-Dimensional Transferable Knowledge Base via Fourier Shifting Operation [3.69758875412828]
Cross-D Conv operation bridges the dimensional gap by learning the phase shifting in the Fourier domain.
Our method enables seamless weight transfer between 2D and 3D convolution operations, effectively facilitating cross-dimensional learning.
arXiv Detail & Related papers (2024-11-02T13:03:44Z) - GS-VTON: Controllable 3D Virtual Try-on with Gaussian Splatting [0.0]
arXiv Detail & Related papers (2024-10-07T17:58:20Z) - Towards Human-Level 3D Relative Pose Estimation: Generalizable, Training-Free, with Single Reference [62.99706119370521]
Humans can easily deduce the relative pose of an unseen object, without label/training, given only a single query-reference image pair.
We propose a novel 3D generalizable relative pose estimation method by elaborating (i) with a 2.5D shape from an RGB-D reference, (ii) with an off-the-shelf differentiable, and (iii) with semantic cues from a pretrained model like DINOv2.
arXiv Detail & Related papers (2024-06-26T16:01:10Z) - Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - WidthFormer: Toward Efficient Transformer-based BEV View Transformation [21.10523575080856]
WidthFormer is a transformer-based module to compute Bird's-Eye-View (BEV) representations from multi-view cameras for real-time autonomous-driving applications.
We first introduce a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information.
We then develop two modules to compensate for potential information loss due to feature compression.
arXiv Detail & Related papers (2024-01-08T11:50:23Z) - BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information.
We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z) - DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z) - MVSTER: Epipolar Transformer for Efficient Multi-View Stereo [26.640495084316925]
Learning-based Multi-View Stereo methods warp source images into 3D volumes, which are fused as a cost volume to be regularized by subsequent networks.
Previous methods utilize extra networks to learn 2D information as fusing cues, underusing 3D spatial correlations and bringing additional computation costs.
We present MVSTER, which leverages the proposed epipolar Transformer to learn both 2D semantics and 3D spatial associations efficiently.
arXiv Detail & Related papers (2022-04-15T06:47:57Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z) - Epipolar Transformers [39.98487207625999]
A common approach to localize 3D human joints in a synchronized and calibrated multi-view setup consists of two-steps.
The 2D detector is limited to solving challenging cases which could potentially be better resolved in 3D.
We propose the differentiable "epipolar transformer", which enables the 2D detector to leverage 3D-aware features to improve 2D pose estimation.
arXiv Detail & Related papers (2020-05-10T02:22:54Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.