TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint
Perception and Prediction in Vision-Centric Autonomous Driving
- URL: http://arxiv.org/abs/2303.09998v2
- Date: Wed, 22 Mar 2023 13:58:12 GMT
- Title: TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint
Perception and Prediction in Vision-Centric Autonomous Driving
- Authors: Shaoheng Fang, Zi Wang, Yiqi Zhong, Junhao Ge, Siheng Chen, Yanfeng
Wang
- Abstract summary: Vision-centric joint perception and prediction has become an emerging trend in autonomous driving research.
It predicts the future states of the participants in the surrounding environment from raw RGB images.
It is still a critical challenge to synchronize features obtained at multiple camera views and timestamps.
- Score: 45.785865869298576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-centric joint perception and prediction (PnP) has become an emerging
trend in autonomous driving research. It predicts the future states of the
traffic participants in the surrounding environment from raw RGB images.
However, it is still a critical challenge to synchronize features obtained at
multiple camera views and timestamps due to inevitable geometric distortions
and further exploit those spatial-temporal features. To address this issue, we
propose a temporal bird's-eye-view pyramid transformer (TBP-Former) for
vision-centric PnP, which includes two novel designs. First, a
pose-synchronized BEV encoder is proposed to map raw image inputs with any
camera pose at any time to a shared and synchronized BEV space for better
spatial-temporal synchronization. Second, a spatial-temporal pyramid
transformer is introduced to comprehensively extract multi-scale BEV features
and predict future BEV states with the support of spatial-temporal priors.
Extensive experiments on nuScenes dataset show that our proposed framework
overall outperforms all state-of-the-art vision-based prediction methods.
Related papers
- DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation [50.01520547454224]
Current generative models struggle to synthesize 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS)
We propose DiST-4D, which disentangles the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency.
Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.
arXiv Detail & Related papers (2025-03-19T13:49:48Z) - Doracamom: Joint 3D Detection and Occupancy Prediction with Multi-view 4D Radars and Cameras for Omnidirectional Perception [9.76463525667238]
We propose Doracamom, the first framework that fuses multi-view cameras and 4D radar for joint 3D object detection and semantic occupancy prediction.
Code and models will be publicly available.
arXiv Detail & Related papers (2025-01-26T04:24:07Z) - Epipolar Attention Field Transformers for Bird's Eye View Semantic Segmentation [26.245188807280684]
This paper addresses the dependency on learned positional encodings to correlate image and BEV feature map elements for transformer-based methods.
We propose leveraging epipolar geometric constraints to model the relationship between cameras and the BEV by Epipolar Attention Fields.
Experiments show that our method EAFormer outperforms previous BEV approaches by 2% mIoU for map semantic segmentation.
arXiv Detail & Related papers (2024-12-02T15:15:10Z) - Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction [60.964512894143475]
We present Generative Spatial Transformer ( GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction.
Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.
arXiv Detail & Related papers (2024-10-24T17:58:05Z) - BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents [56.33989853438012]
We propose BEVWorld, a framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View latent space for holistic environment modeling.
The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model.
arXiv Detail & Related papers (2024-07-08T07:26:08Z) - TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation [9.723276622743473]
We develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces.
Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation.
arXiv Detail & Related papers (2024-04-17T23:49:00Z) - Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years.
Data-driven simulation for autonomous driving has been a focal point of recent research.
We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z) - ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal
Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously.
To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - GitNet: Geometric Prior-based Transformation for Birds-Eye-View
Segmentation [105.19949897812494]
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving.
We present a novel two-stage Geometry Prior-based Transformation framework named GitNet.
arXiv Detail & Related papers (2022-04-16T06:46:45Z) - BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera
Images via Spatiotemporal Transformers [39.253627257740085]
3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems.
We present a new framework termed BEVFormer, which learns unified BEV representations with transformers to support multiple autonomous driving perception tasks.
We show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions.
arXiv Detail & Related papers (2022-03-31T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.