Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling
- URL: http://arxiv.org/abs/2301.01006v1
- Date: Tue, 3 Jan 2023 08:52:49 GMT
- Title: Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling
- Authors: Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, Yu Qiao
- Abstract summary: We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving.
We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos.
In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input.
In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
- Score: 96.31941517446859
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Witnessing the impressive achievements of pre-training techniques on
large-scale data in the field of computer vision and natural language
processing, we wonder whether this idea could be adapted in a grab-and-go
spirit, and mitigate the sample inefficiency problem for visuomotor driving.
Given the highly dynamic and variant nature of the input, the visuomotor
driving task inherently lacks view and translation invariance, and the visual
input contains massive irrelevant information for decision making, resulting in
predominant pre-training approaches from general vision less suitable for the
autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via
Geometric modeling), an intuitive and straightforward fully self-supervised
framework curated for the policy pretraining in visuomotor driving. We aim at
learning policy representations as a powerful abstraction by modeling 3D
geometric scenes on large-scale unlabeled and uncalibrated YouTube driving
videos. The proposed PPGeo is performed in two stages to support effective
self-supervised training. In the first stage, the geometric modeling framework
generates pose and depth predictions simultaneously, with two consecutive
frames as input. In the second stage, the visual encoder learns driving policy
representation by predicting the future ego-motion and optimizing with the
photometric error based on current visual observation only. As such, the
pre-trained visual encoder is equipped with rich driving policy related
representations and thereby competent for multiple visuomotor driving tasks.
Extensive experiments covering a wide span of challenging scenarios have
demonstrated the superiority of our proposed approach, where improvements range
from 2% to even over 100% with very limited data. Code and models will be
available at https://github.com/OpenDriveLab/PPGeo.
Related papers
- End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation [34.070813293944944]
We propose UAD, a method for vision-based end-to-end autonomous driving (E2EAD)
Our motivation stems from the observation that current E2EAD models still mimic the modular architecture in typical driving stacks.
Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate in nuScenes and surpasses VAD for 41.32 points on the driving score in CARLA's Town05 Long benchmark.
arXiv Detail & Related papers (2024-06-25T16:12:52Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - On depth prediction for autonomous driving using self-supervised
learning [0.0]
This thesis focuses on the challenge of depth prediction using monocular self-supervised learning techniques.
The problem is approached from a broader perspective, exploring conditional generative adversarial networks (cGANs)
The second contribution entails a single image-to-depth self-supervised method, proposing a solution for the rigid-scene assumption.
The third significant aspect involves the introduction of a video-to-depth map forecasting approach.
arXiv Detail & Related papers (2024-03-10T12:33:12Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Pre-training on Synthetic Driving Data for Trajectory Prediction [61.520225216107306]
We propose a pipeline-level solution to mitigate the issue of data scarcity in trajectory forecasting.
We adopt HD map augmentation and trajectory synthesis for generating driving data, and then we learn representations by pre-training on them.
We conduct extensive experiments to demonstrate the effectiveness of our data expansion and pre-training strategies.
arXiv Detail & Related papers (2023-09-18T19:49:22Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.