BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving
- URL: http://arxiv.org/abs/2205.09743v1
- Date: Thu, 19 May 2022 17:55:35 GMT
- Title: BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving
- Authors: Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie
Zhou, Jiwen Lu
- Abstract summary: We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
- Score: 92.05963633802979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present BEVerse, a unified framework for 3D perception and
prediction based on multi-camera systems. Unlike existing studies focusing on
the improvement of single-task approaches, BEVerse features in producing
spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos
and jointly reasoning about multiple tasks for vision-centric autonomous
driving. Specifically, BEVerse first performs shared feature extraction and
lifting to generate 4D BEV representations from multi-timestamp and multi-view
images. After the ego-motion alignment, the spatio-temporal encoder is utilized
for further feature extraction in BEV. Finally, multiple task decoders are
attached for joint reasoning and prediction. Within the decoders, we propose
the grid sampler to generate BEV features with different ranges and
granularities for different tasks. Also, we design the method of iterative flow
for memory-efficient future prediction. We show that the temporal information
improves 3D object detection and semantic map construction, while the
multi-task learning can implicitly benefit motion prediction. With extensive
experiments on the nuScenes dataset, we show that the multi-task BEVerse
outperforms existing single-task methods on 3D object detection, semantic map
construction, and motion prediction. Compared with the sequential paradigm,
BEVerse also favors in significantly improved efficiency. The code and trained
models will be released at https://github.com/zhangyp15/BEVerse.
Related papers
- HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras [45.739224968302565]
We present an end-to-end framework named HENet for multi-task 3D perception.
Specifically, we propose a hybrid image encoding network, using a large image encoder for short-term frames and a small image encoder for long-term temporal frames.
According to the characteristics of each perception task, we utilize BEV features of different grid sizes, independent BEV encoders, and task decoders for different tasks.
arXiv Detail & Related papers (2024-04-03T07:10:18Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost.
Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm.
We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.