Multi-Camera Calibration Free BEV Representation for 3D Object Detection
- URL: http://arxiv.org/abs/2210.17252v1
- Date: Mon, 31 Oct 2022 12:18:08 GMT
- Title: Multi-Camera Calibration Free BEV Representation for 3D Object Detection
- Authors: Hongxiang Jiang, Wenming Meng, Hongmei Zhu, Qian Zhang, Jihao Yin
- Abstract summary: We present a completely Multi-Camera Free Transformer (CFT) for robust Bird's Eye View (BEV) representation.
CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA)
CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters.
- Score: 8.085831393926561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In advanced paradigms of autonomous driving, learning Bird's Eye View (BEV)
representation from surrounding views is crucial for multi-task framework.
However, existing methods based on depth estimation or camera-driven attention
are not stable to obtain transformation under noisy camera parameters, mainly
with two challenges, accurate depth prediction and calibration. In this work,
we present a completely Multi-Camera Calibration Free Transformer (CFT) for
robust BEV representation, which focuses on exploring implicit mapping, not
relied on camera intrinsics and extrinsics. To guide better feature learning
from image views to BEV, CFT mines potential 3D information in BEV via our
designed position-aware enhancement (PA). Instead of camera-driven point-wise
or global transformation, for interaction within more effective region and
lower computation cost, we propose a view-aware attention which also reduces
redundant computation and promotes converge. CFT achieves 49.7% NDS on the
nuScenes detection task leaderboard, which is the first work removing camera
parameters, comparable to other geometry-guided methods. Without temporal input
and other modal information, CFT achieves second highest performance with a
smaller image input 1600 * 640. Thanks to view-attention variant, CFT reduces
memory and transformer FLOPs for vanilla attention by about 12% and 60%,
respectively, with improved NDS by 1.0%. Moreover, its natural robustness to
noisy camera parameters makes CFT more competitive.
Related papers
- WidthFormer: Toward Efficient Transformer-based BEV View Transformation [21.10523575080856]
WidthFormer is a transformer-based module to compute Bird's-Eye-View (BEV) representations from multi-view cameras for real-time autonomous-driving applications.
We first introduce a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information.
We then develop two modules to compensate for potential information loss due to feature compression.
arXiv Detail & Related papers (2024-01-08T11:50:23Z) - CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity [34.025530326420146]
We develop Complementary-BEV, a novel end-to-end monocular 3D object detection framework.
We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D.
For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode.
arXiv Detail & Related papers (2023-10-04T13:38:53Z) - Multi-camera Bird's Eye View Perception for Autonomous Driving [17.834495597639805]
It is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures.
The most basic approach to achieving the desired BEV representation from a camera image is IPM, assuming a flat ground surface.
More recent approaches use deep neural networks to output directly in BEV space.
arXiv Detail & Related papers (2023-09-16T19:12:05Z) - An Efficient Transformer for Simultaneous Learning of BEV and Lane
Representations in 3D Lane Detection [55.281369497158515]
We propose an efficient transformer for 3D lane detection.
Different from the vanilla transformer, our model contains a cross-attention mechanism to simultaneously learn lane and BEV representations.
Our method obtains 2D and 3D lane predictions by applying the lane features to the image-view and BEV features, respectively.
arXiv Detail & Related papers (2023-06-08T04:18:31Z) - Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline [76.48192454417138]
Bird's-Eye View (BEV) representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception.
This paper proposes a framework, termed Fast-BEV, which is capable of performing faster BEV perception on the on-vehicle chips.
arXiv Detail & Related papers (2023-01-29T18:43:31Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z) - BEVDet: High-performance Multi-camera 3D Object Detection in
Bird-Eye-View [15.560366079077449]
We contribute the BEVDet paradigm for pushing the performance boundary in 2D object detection task.
BeVDet is developed by following the principle of detecting the 3D objects in Bird-Eye-View (BEV), where route planning can be handily performed.
The proposed paradigm works well in multi-camera 3D object detection and offers a good trade-off between computing budget and performance.
arXiv Detail & Related papers (2021-12-22T10:48:06Z) - Robust 2D/3D Vehicle Parsing in CVIS [54.825777404511605]
We present a novel approach to robustly detect and perceive vehicles in different camera views as part of a cooperative vehicle-infrastructure system (CVIS)
Our formulation is designed for arbitrary camera views and makes no assumptions about intrinsic or extrinsic parameters.
In practice, our approach outperforms SOTA methods on 2D detection, instance segmentation, and 6-DoF pose estimation.
arXiv Detail & Related papers (2021-03-11T03:35:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.