Related papers: UniWorld: Autonomous Driving Pre-training via World Models

UniWorld: Autonomous Driving Pre-training via World Models

URL: http://arxiv.org/abs/2308.07234v1
Date: Mon, 14 Aug 2023 16:17:13 GMT
Title: UniWorld: Autonomous Driving Pre-training via World Models
Authors: Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, Bin Dai
Abstract summary: We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants. UniWorld can estimate missing information concerning the world state and predict plausible future states of the world. UniWorld's pre-training process is label-free, enabling the utilization of massive amounts of image-LiDAR pairs to build a Foundational Model.
Score: 12.34628913148789
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we draw inspiration from Alberto Elfes' pioneering work in 1989, where he introduced the concept of the occupancy grid as World Models for robots. We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants. UniWorld involves initially predicting 4D geometric occupancy as the World Models for foundational stage and subsequently fine-tuning on downstream tasks. UniWorld can estimate missing information concerning the world state and predict plausible future states of the world. Besides, UniWorld's pre-training process is label-free, enabling the utilization of massive amounts of image-LiDAR pairs to build a Foundational Model.The proposed unified pre-training framework demonstrates promising results in key tasks such as motion prediction, multi-camera 3D object detection, and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniWorld shows a significant improvement of about 1.5% in IoU for motion prediction, 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniWorld.

Related papers

Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving [22.832008530490167]
We propose a semi-supervised vision-centric 3D occupancy world model, PreWorld, to leverage the potential of 2D labels. PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.
arXiv Detail & Related papers (2025-02-11T07:12:26Z)
GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction [67.81475355852997]
3D occupancy prediction is important for autonomous driving due to its comprehensive perception of the surroundings. We propose a world-model-based framework to exploit the scene evolution for perception. Our framework improves the performance of the single-frame counterpart by over 2% in mIoU without introducing additional computations.
arXiv Detail & Related papers (2024-12-13T18:59:54Z)
Multi-Transmotion: Pre-trained Model for Human Motion Prediction [68.87010221355223]
Multi-Transmotion is an innovative transformer-based model designed for cross-modality pre-training. Our methodology demonstrates competitive performance across various datasets on several downstream tasks.
arXiv Detail & Related papers (2024-11-04T23:15:21Z)
DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model [14.996395953240699]
DOME is a diffusion-based world model that predicts future occupancy frames based on past occupancy observations. The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving.
arXiv Detail & Related papers (2024-10-14T12:24:32Z)
Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving [15.100104512786107]
Drive-OccWorld adapts a visioncentric- 4D forecasting world model to end-to-end planning for autonomous driving. We propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model. Experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy.
arXiv Detail & Related papers (2024-08-26T11:53:09Z)
DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion. DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z)
3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z)
Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion [36.321494200830244]
Copilot4D is a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
arXiv Detail & Related papers (2023-11-02T06:21:56Z)
UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving [11.507979392707448]
We propose the first multi-camera unified pre-training framework, called UniScene. We employ Occupancy as the general representation for the 3D scene, enabling the model to grasp geometric priors of the surrounding world. UniScene shows a significant improvement of about 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion.
arXiv Detail & Related papers (2023-05-30T08:23:06Z)
Predictive World Models from Real-World Partial Observations [66.80340484148931]
We present a framework for learning a probabilistic predictive world model for real-world road environments. While prior methods require complete states as ground truth for learning, we present a novel sequential training method to allow HVAEs to learn to predict complete states from partially observed states only.
arXiv Detail & Related papers (2023-01-12T02:07:26Z)
T3VIP: Transformation-based 3D Video Prediction [49.178585201673364]
We propose a 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts. Our model is fully unsupervised, captures the nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
arXiv Detail & Related papers (2022-09-19T15:01:09Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.