BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera
Images via Spatiotemporal Transformers
- URL: http://arxiv.org/abs/2203.17270v1
- Date: Thu, 31 Mar 2022 17:59:01 GMT
- Title: BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera
Images via Spatiotemporal Transformers
- Authors: Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu,
Qiao Yu, Jifeng Dai
- Abstract summary: 3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems.
We present a new framework termed BEVFormer, which learns unified BEV representations with transformers to support multiple autonomous driving perception tasks.
We show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions.
- Score: 39.253627257740085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D visual perception tasks, including 3D detection and map segmentation based
on multi-camera images, are essential for autonomous driving systems. In this
work, we present a new framework termed BEVFormer, which learns unified BEV
representations with spatiotemporal transformers to support multiple autonomous
driving perception tasks. In a nutshell, BEVFormer exploits both spatial and
temporal information by interacting with spatial and temporal space through
predefined grid-shaped BEV queries. To aggregate spatial information, we design
a spatial cross-attention that each BEV query extracts the spatial features
from the regions of interest across camera views. For temporal information, we
propose a temporal self-attention to recurrently fuse the history BEV
information. Our approach achieves the new state-of-the-art 56.9\% in terms of
NDS metric on the nuScenes test set, which is 9.0 points higher than previous
best arts and on par with the performance of LiDAR-based baselines. We further
show that BEVFormer remarkably improves the accuracy of velocity estimation and
recall of objects under low visibility conditions. The code will be released at
https://github.com/zhiqi-li/BEVFormer.
Related papers
- TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation [9.723276622743473]
We develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces.
Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation.
arXiv Detail & Related papers (2024-04-17T23:49:00Z) - DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception [104.87876441265593]
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space.
Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored.
We design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features.
arXiv Detail & Related papers (2024-01-13T04:21:24Z) - Multi-camera Bird's Eye View Perception for Autonomous Driving [17.834495597639805]
It is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures.
The most basic approach to achieving the desired BEV representation from a camera image is IPM, assuming a flat ground surface.
More recent approaches use deep neural networks to output directly in BEV space.
arXiv Detail & Related papers (2023-09-16T19:12:05Z) - SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view
3D Object Detection [46.92706423094971]
We propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features.
We also propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature.
Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-21T10:28:19Z) - OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost.
Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm.
We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z) - BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via
Cross-Modality Guidance and Temporal Aggregation [14.606324706328106]
We propose a dual-branch framework to generate LiDAR and camera BEV, then perform an adaptive modality fusion.
A LiDAR-Guided View Transformer (LGVT) is designed to effectively obtain the camera representation in BEV space.
Our framework dubbed BEVFusion4D achieves state-of-the-art results in 3D object detection.
arXiv Detail & Related papers (2023-03-30T02:18:07Z) - BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud
Pre-training in Autonomous Driving Scenarios [51.285561119993105]
We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving.
Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation.
We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
arXiv Detail & Related papers (2022-12-12T08:15:03Z) - Delving into the Devils of Bird's-eye-view Perception: A Review,
Evaluation and Recipe [115.31507979199564]
Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia.
As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.
The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios.
arXiv Detail & Related papers (2022-09-12T15:29:13Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.