ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal
Feature Learning
- URL: http://arxiv.org/abs/2207.07601v2
- Date: Mon, 18 Jul 2022 02:36:43 GMT
- Title: ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal
Feature Learning
- Authors: Shengchao Hu and Li Chen and Penghao Wu and Hongyang Li and Junchi Yan
and Dacheng Tao
- Abstract summary: We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously.
To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
- Score: 132.20119288212376
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Many existing autonomous driving paradigms involve a multi-stage discrete
pipeline of tasks. To better predict the control signals and enhance user
safety, an end-to-end approach that benefits from joint spatial-temporal
feature learning is desirable. While there are some pioneering works on
LiDAR-based input or implicit design, in this paper we formulate the problem in
an interpretable vision-based setting. In particular, we propose a
spatial-temporal feature learning scheme towards a set of more representative
features for perception, prediction and planning tasks simultaneously, which is
called ST-P3. Specifically, an egocentric-aligned accumulation technique is
proposed to preserve geometry information in 3D space before the bird's eye
view transformation for perception; a dual pathway modeling is devised to take
past motion variations into account for future prediction; a temporal-based
refinement unit is introduced to compensate for recognizing vision-based
elements for planning. To the best of our knowledge, we are the first to
systematically investigate each part of an interpretable end-to-end
vision-based autonomous driving system. We benchmark our approach against
previous state-of-the-arts on both open-loop nuScenes dataset as well as
closed-loop CARLA simulation. The results show the effectiveness of our method.
Source code, model and protocol details are made publicly available at
https://github.com/OpenPerceptionX/ST-P3.
Related papers
- AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction [56.72301849123049]
We present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ dataset challenge at CVPR 2024.
Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling.
Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth.
arXiv Detail & Related papers (2024-07-01T16:32:15Z) - Unified Spatio-Temporal Tri-Perspective View Representation for 3D Semantic Occupancy Prediction [6.527178779672975]
This study introduces architecture2TPVFormer for temporally coherent 3D semantic occupancy prediction.
We enrich the prior process by including temporal cues using a novel temporal cross-view hybrid attention mechanism.
Experimental evaluations demonstrate a substantial 4.1% improvement in mean Intersection over Union for 3D Semantic Occupancy.
arXiv Detail & Related papers (2024-01-24T20:06:59Z) - Visual Point Cloud Forecasting enables Scalable Autonomous Driving [28.376086570498952]
Visual autonomous driving applications require features encompassing semantics, 3D geometry, and temporal information simultaneously.
We present ViDAR, a general model to pre-train downstream visual encoders.
Experiments show significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, 10% error reduction on motion forecasting, and 15% less collision rate on planning.
arXiv Detail & Related papers (2023-12-29T15:44:13Z) - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision.
We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range.
For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z) - SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations [76.45009891152178]
Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks.
We show, for the first time, that general representations learning can be achieved through the task of occupancy prediction.
Our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.
arXiv Detail & Related papers (2023-09-19T11:13:01Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - PillarGrid: Deep Learning-based Cooperative Perception for 3D Object
Detection from Onboard-Roadside LiDAR [15.195933965761645]
We propose textitPillarGrid, a novel cooperative perception method fusing information from multiple 3D LiDARs.
PillarGrid consists of four main phases: 1) cooperative preprocessing of point clouds, 2) pillar-wise voxelization and feature extraction, 3) grid-wise deep fusion of features from multiple sensors, and 4) convolutional neural network (CNN)-based augmented 3D object detection.
Extensive experimentation shows that PillarGrid outperforms the SOTA single-LiDAR-based 3D object detection methods with respect to both accuracy and range by a large margin.
arXiv Detail & Related papers (2022-03-12T02:28:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.