Visual Point Cloud Forecasting enables Scalable Autonomous Driving
- URL: http://arxiv.org/abs/2312.17655v1
- Date: Fri, 29 Dec 2023 15:44:13 GMT
- Title: Visual Point Cloud Forecasting enables Scalable Autonomous Driving
- Authors: Zetong Yang, Li Chen, Yanan Sun, Hongyang Li
- Abstract summary: Visual autonomous driving applications require features encompassing semantics, 3D geometry, and temporal information simultaneously.
We present ViDAR, a general model to pre-train downstream visual encoders.
Experiments show significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, 10% error reduction on motion forecasting, and 15% less collision rate on planning.
- Score: 28.376086570498952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In contrast to extensive studies on general vision, pre-training for scalable
visual autonomous driving remains seldom explored. Visual autonomous driving
applications require features encompassing semantics, 3D geometry, and temporal
information simultaneously for joint perception, prediction, and planning,
posing dramatic challenges for pre-training. To resolve this, we bring up a new
pre-training task termed as visual point cloud forecasting - predicting future
point clouds from historical visual input. The key merit of this task captures
the synergic learning of semantics, 3D structures, and temporal dynamics. Hence
it shows superiority in various downstream tasks. To cope with this new
problem, we present ViDAR, a general model to pre-train downstream visual
encoders. It first extracts historical embeddings by the encoder. These
representations are then transformed to 3D geometric space via a novel Latent
Rendering operator for future point cloud prediction. Experiments show
significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, ~10%
error reduction on motion forecasting, and ~15% less collision rate on
planning.
Related papers
- Vision-based 3D occupancy prediction in autonomous driving: a review and outlook [19.939380586314673]
We introduce the background of vision-based 3D occupancy prediction and discuss the challenges in this task.
We conduct a comprehensive survey of the progress in vision-based 3D occupancy prediction from three aspects.
We present a summary of prevailing research trends and propose some inspiring future outlooks.
arXiv Detail & Related papers (2024-05-04T07:39:25Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Unified Spatio-Temporal Tri-Perspective View Representation for 3D Semantic Occupancy Prediction [6.527178779672975]
This study introduces architecture2TPVFormer for temporally coherent 3D semantic occupancy prediction.
We enrich the prior process by including temporal cues using a novel temporal cross-view hybrid attention mechanism.
Experimental evaluations demonstrate a substantial 4.1% improvement in mean Intersection over Union for 3D Semantic Occupancy.
arXiv Detail & Related papers (2024-01-24T20:06:59Z) - SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations [76.45009891152178]
Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks.
We show, for the first time, that general representations learning can be achieved through the task of occupancy prediction.
Our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.
arXiv Detail & Related papers (2023-09-19T11:13:01Z) - 3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive
Physics under Challenging Scenes [68.66237114509264]
We present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids.
We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space.
arXiv Detail & Related papers (2023-04-22T19:28:49Z) - ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal
Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously.
To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z) - Self-supervised Point Cloud Prediction Using 3D Spatio-temporal
Convolutional Networks [27.49539859498477]
Exploiting past 3D LiDAR scans to predict future point clouds is a promising method for autonomous mobile systems.
We propose an end-to-end approach that exploits a 2D range image representation of each 3D LiDAR scan.
We develop an encoder-decoder architecture using 3D convolutions to jointly aggregate spatial and temporal information of the scene.
arXiv Detail & Related papers (2021-09-28T19:58:13Z) - Spatio-temporal Self-Supervised Representation Learning for 3D Point
Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks.
Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data.
STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z) - Scalable Scene Flow from Point Clouds in the Real World [30.437100097997245]
We introduce a new large scale benchmark for scene flow based on the Open dataset.
We show how previous works were bounded based on the amount of real LiDAR data available.
We introduce the model architecture FastFlow3D that provides real time inference on the full point cloud.
arXiv Detail & Related papers (2021-03-01T20:56:05Z) - PointContrast: Unsupervised Pre-training for 3D Point Cloud
Understanding [107.02479689909164]
In this work, we aim at facilitating research on 3D representation learning.
We measure the effect of unsupervised pre-training on a large source set of 3D scenes.
arXiv Detail & Related papers (2020-07-21T17:59:22Z) - 3DMotion-Net: Learning Continuous Flow Function for 3D Motion Prediction [12.323767993152968]
We deal with the problem to predict the future 3D motions of 3D object scans from previous two consecutive frames.
We propose a self-supervised approach that leverages the power of the deep neural network to learn a continuous flow function of 3D point clouds.
We perform extensive experiments on D-FAUST, SCAPE and TOSCA benchmark data sets and the results demonstrate that our approach is capable of handling temporally inconsistent input.
arXiv Detail & Related papers (2020-06-24T17:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.