UniWorld: Autonomous Driving Pre-training via World Models
- URL: http://arxiv.org/abs/2308.07234v1
- Date: Mon, 14 Aug 2023 16:17:13 GMT
- Title: UniWorld: Autonomous Driving Pre-training via World Models
- Authors: Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, Bin Dai
- Abstract summary: We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants.
UniWorld can estimate missing information concerning the world state and predict plausible future states of the world.
UniWorld's pre-training process is label-free, enabling the utilization of massive amounts of image-LiDAR pairs to build a Foundational Model.
- Score: 12.34628913148789
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we draw inspiration from Alberto Elfes' pioneering work in
1989, where he introduced the concept of the occupancy grid as World Models for
robots. We imbue the robot with a spatial-temporal world model, termed
UniWorld, to perceive its surroundings and predict the future behavior of other
participants. UniWorld involves initially predicting 4D geometric occupancy as
the World Models for foundational stage and subsequently fine-tuning on
downstream tasks. UniWorld can estimate missing information concerning the
world state and predict plausible future states of the world. Besides,
UniWorld's pre-training process is label-free, enabling the utilization of
massive amounts of image-LiDAR pairs to build a Foundational Model.The proposed
unified pre-training framework demonstrates promising results in key tasks such
as motion prediction, multi-camera 3D object detection, and surrounding
semantic scene completion. When compared to monocular pre-training methods on
the nuScenes dataset, UniWorld shows a significant improvement of about 1.5% in
IoU for motion prediction, 2.0% in mAP and 2.0% in NDS for multi-camera 3D
object detection, as well as a 3% increase in mIoU for surrounding semantic
scene completion. By adopting our unified pre-training method, a 25% reduction
in 3D training annotation costs can be achieved, offering significant practical
value for the implementation of real-world autonomous driving. Codes are
publicly available at https://github.com/chaytonmin/UniWorld.
Related papers
- Multi-Transmotion: Pre-trained Model for Human Motion Prediction [68.87010221355223]
Multi-Transmotion is an innovative transformer-based model designed for cross-modality pre-training.
Our methodology demonstrates competitive performance across various datasets on several downstream tasks.
arXiv Detail & Related papers (2024-11-04T23:15:21Z) - DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model [14.996395953240699]
DOME is a diffusion-based world model that predicts future occupancy frames based on past occupancy observations.
The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving.
arXiv Detail & Related papers (2024-10-14T12:24:32Z) - Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving [15.100104512786107]
Drive-OccWorld adapts a visioncentric- 4D forecasting world model to end-to-end planning for autonomous driving.
We propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model.
Experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy.
arXiv Detail & Related papers (2024-08-26T11:53:09Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - 3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.
We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action.
Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z) - Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion [36.321494200830244]
Copilot4D is a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion.
Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
arXiv Detail & Related papers (2023-11-02T06:21:56Z) - UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving [11.507979392707448]
We propose the first multi-camera unified pre-training framework, called UniScene.
We employ Occupancy as the general representation for the 3D scene, enabling the model to grasp geometric priors of the surrounding world.
UniScene shows a significant improvement of about 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion.
arXiv Detail & Related papers (2023-05-30T08:23:06Z) - Predictive World Models from Real-World Partial Observations [66.80340484148931]
We present a framework for learning a probabilistic predictive world model for real-world road environments.
While prior methods require complete states as ground truth for learning, we present a novel sequential training method to allow HVAEs to learn to predict complete states from partially observed states only.
arXiv Detail & Related papers (2023-01-12T02:07:26Z) - T3VIP: Transformation-based 3D Video Prediction [49.178585201673364]
We propose a 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts.
Our model is fully unsupervised, captures the nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals.
To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
arXiv Detail & Related papers (2022-09-19T15:01:09Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.