CASPFormer: Trajectory Prediction from BEV Images with Deformable
Attention
- URL: http://arxiv.org/abs/2409.17790v1
- Date: Thu, 26 Sep 2024 12:37:22 GMT
- Title: CASPFormer: Trajectory Prediction from BEV Images with Deformable
Attention
- Authors: Harsh Yadav, Maximilian Schaefer, Kun Zhao, and Tobias Meisen
- Abstract summary: We propose Context Aware Scene Prediction Transformer (CASPFormer), which can perform multi-modal motion prediction from spatialized Bird-Eye-View (BEV) images.
Our system can be integrated with any upstream perception module that is capable of generating BEV images.
We evaluate our model on the nuScenes dataset and show that it reaches state-of-the-art across multiple metrics.
- Score: 4.9349065371630045
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Motion prediction is an important aspect for Autonomous Driving (AD) and
Advance Driver Assistance Systems (ADAS). Current state-of-the-art motion
prediction methods rely on High Definition (HD) maps for capturing the
surrounding context of the ego vehicle. Such systems lack scalability in
real-world deployment as HD maps are expensive to produce and update in
real-time. To overcome this issue, we propose Context Aware Scene Prediction
Transformer (CASPFormer), which can perform multi-modal motion prediction from
rasterized Bird-Eye-View (BEV) images. Our system can be integrated with any
upstream perception module that is capable of generating BEV images. Moreover,
CASPFormer directly decodes vectorized trajectories without any postprocessing.
Trajectories are decoded recurrently using deformable attention, as it is
computationally efficient and provides the network with the ability to focus
its attention on the important spatial locations of the BEV images. In
addition, we also address the issue of mode collapse for generating multiple
scene-consistent trajectories by incorporating learnable mode queries. We
evaluate our model on the nuScenes dataset and show that it reaches
state-of-the-art across multiple metrics
Related papers
- BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space [57.68134574076005]
We present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View latent space for environment modeling.
Experiments demonstrate the effectiveness of BEVWorld in autonomous driving tasks, showcasing its capability in generating future scenes and benefiting downstream tasks such as perception and motion prediction.
arXiv Detail & Related papers (2024-07-08T07:26:08Z) - Optimizing Ego Vehicle Trajectory Prediction: The Graph Enhancement
Approach [1.3931837019950217]
We advocate for the use of Bird's Eye View perspectives, which offer unique advantages in capturing spatial relationships and object homogeneity.
In our work, we leverage Graph Neural Networks (GNNs) and positional encoding to represent objects in a BEV, achieving competitive performance compared to traditional methods.
arXiv Detail & Related papers (2023-12-20T15:22:34Z) - Context-Aware Timewise VAEs for Real-Time Vehicle Trajectory Prediction [4.640835690336652]
We present ContextVAE, a context-aware approach for multi-modal vehicle trajectory prediction.
Our approach takes into account both the social features exhibited by agents on the scene and the physical environment constraints.
In all tested datasets, ContextVAE models are fast to train and provide high-quality multi-modal predictions in real-time.
arXiv Detail & Related papers (2023-02-21T18:42:24Z) - Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years.
Data-driven simulation for autonomous driving has been a focal point of recent research.
We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z) - Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving.
We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos.
In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input.
In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z) - BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud
Pre-training in Autonomous Driving Scenarios [51.285561119993105]
We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving.
Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation.
We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
arXiv Detail & Related papers (2022-12-12T08:15:03Z) - BEV-Locator: An End-to-end Visual Semantic Localization Network Using
Multi-View Images [13.258689143949912]
We propose an end-to-end visual semantic localization neural network using multi-view camera images.
The BEV-Locator is capable to estimate the vehicle poses under versatile scenarios.
Experiments report satisfactory accuracy with mean absolute errors of 0.052m, 0.135m and 0.251$circ$ in lateral, longitudinal translation and heading angle degree.
arXiv Detail & Related papers (2022-11-27T20:24:56Z) - Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view.
Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera
Images via Spatiotemporal Transformers [39.253627257740085]
3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems.
We present a new framework termed BEVFormer, which learns unified BEV representations with transformers to support multiple autonomous driving perception tasks.
We show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions.
arXiv Detail & Related papers (2022-03-31T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.