Does End-to-End Autonomous Driving Really Need Perception Tasks?
- URL: http://arxiv.org/abs/2409.18341v1
- Date: Thu, 26 Sep 2024 23:30:48 GMT
- Title: Does End-to-End Autonomous Driving Really Need Perception Tasks?
- Authors: Peidong Li, Dixiao Cui,
- Abstract summary: We introduce SSR, a novel framework that utilizes only 16 navigation-guided tokens as Sparse Scene Representation.
Our method eliminates the need for supervised sub-tasks, allowing computational resources to concentrate on essential elements related to navigation intent.
SSR achieves state-of-the-art planning performance on the nuScenes dataset, demonstrating a 27.2% relative reduction in L2 error and a 51.6% decrease in collision rate to the leading E2EAD method, UniAD.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-End Autonomous Driving (E2EAD) methods typically rely on supervised perception tasks to extract explicit scene information (e.g., objects, maps). This reliance necessitates expensive annotations and constrains deployment and data scalability in real-time applications. In this paper, we introduce SSR, a novel framework that utilizes only 16 navigation-guided tokens as Sparse Scene Representation, efficiently extracting crucial scene information for E2EAD. Our method eliminates the need for supervised sub-tasks, allowing computational resources to concentrate on essential elements directly related to navigation intent. We further introduce a temporal enhancement module that employs a Bird's-Eye View (BEV) world model, aligning predicted future scenes with actual future scenes through self-supervision. SSR achieves state-of-the-art planning performance on the nuScenes dataset, demonstrating a 27.2\% relative reduction in L2 error and a 51.6\% decrease in collision rate to the leading E2EAD method, UniAD. Moreover, SSR offers a 10.9$\times$ faster inference speed and 13$\times$ faster training time. This framework represents a significant leap in real-time autonomous driving systems and paves the way for future scalable deployment. Code will be released at \url{https://github.com/PeidongLi/SSR}.
Related papers
- End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation [34.070813293944944]
We propose UAD, a method for vision-based end-to-end autonomous driving (E2EAD)
Our motivation stems from the observation that current E2EAD models still mimic the modular architecture in typical driving stacks.
Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate in nuScenes and surpasses VAD for 41.32 points on the driving score in CARLA's Town05 Long benchmark.
arXiv Detail & Related papers (2024-06-25T16:12:52Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in
nuScenes [38.43491956142818]
Planning task involves predicting the trajectory of the ego vehicle based on inputs from both internal intention and the external environment.
Most existing works evaluate their performance on the nuScenes dataset using the L2 error and collision rate between the predicted trajectories and the ground truth.
In this paper, we reevaluate these existing evaluation metrics and explore whether they accurately measure the superiority of different methods.
Our simple method achieves similar end-to-end planning performance on the nuScenes dataset with other perception-based methods, reducing the average L2 error by about 20%.
arXiv Detail & Related papers (2023-05-17T17:59:11Z) - VAD: Vectorized Scene Representation for Efficient Autonomous Driving [44.070636456960045]
VAD is an end-to-end vectorized paradigm for autonomous driving.
VAD exploits the vectorized agent motion and map elements as explicit instance-level planning constraints.
VAD runs much faster than previous end-to-end planning methods.
arXiv Detail & Related papers (2023-03-21T17:59:22Z) - Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving.
We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos.
In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input.
In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z) - ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal
Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously.
To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z) - LiMoSeg: Real-time Bird's Eye View based LiDAR Motion Segmentation [8.184561295177623]
This paper proposes a novel real-time architecture for motion segmentation of Light Detection and Ranging (LiDAR) data.
We use two successive scans of LiDAR data in 2D Bird's Eye View representation to perform pixel-wise classification as static or moving.
We demonstrate a low latency of 8 ms on a commonly used automotive embedded platform, namely Nvidia Jetson Xavier.
arXiv Detail & Related papers (2021-11-08T23:40:55Z) - NEAT: Neural Attention Fields for End-to-End Autonomous Driving [59.60483620730437]
We present NEural ATtention fields (NEAT), a novel representation that enables efficient reasoning for imitation learning models.
NEAT is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics.
In a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert.
arXiv Detail & Related papers (2021-09-09T17:55:28Z) - DMD: A Large-Scale Multi-Modal Driver Monitoring Dataset for Attention
and Alertness Analysis [54.198237164152786]
Vision is the richest and most cost-effective technology for Driver Monitoring Systems (DMS)
The lack of sufficiently large and comprehensive datasets is currently a bottleneck for the progress of DMS development.
In this paper, we introduce the Driver Monitoring dataset (DMD), an extensive dataset which includes real and simulated driving scenarios.
arXiv Detail & Related papers (2020-08-27T12:33:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.