Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving
- URL: http://arxiv.org/abs/2502.07309v1
- Date: Tue, 11 Feb 2025 07:12:26 GMT
- Title: Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving
- Authors: Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, Yilun Chen,
- Abstract summary: We propose a semi-supervised vision-centric 3D occupancy world model, PreWorld, to leverage the potential of 2D labels.
PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.
- Score: 22.832008530490167
- License:
- Abstract: Understanding world dynamics is crucial for planning in autonomous driving. Recent methods attempt to achieve this by learning a 3D occupancy world model that forecasts future surrounding scenes based on current observation. However, 3D occupancy labels are still required to produce promising results. Considering the high annotation cost for 3D outdoor scenes, we propose a semi-supervised vision-centric 3D occupancy world model, PreWorld, to leverage the potential of 2D labels through a novel two-stage training paradigm: the self-supervised pre-training stage and the fully-supervised fine-tuning stage. Specifically, during the pre-training stage, we utilize an attribute projection head to generate different attribute fields of a scene (e.g., RGB, density, semantic), thus enabling temporal supervision from 2D labels via volume rendering techniques. Furthermore, we introduce a simple yet effective state-conditioned forecasting module to recursively forecast future occupancy and ego trajectory in a direct manner. Extensive experiments on the nuScenes dataset validate the effectiveness and scalability of our method, and demonstrate that PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.
Related papers
- An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training [50.71892161377806]
DFIT-OccWorld is an efficient 3D occupancy world model that leverages decoupled dynamic flow and image-assisted training strategy.
Our model forecasts future dynamic voxels by warping existing observations using voxel flow, whereas static voxels are easily obtained through pose transformation.
arXiv Detail & Related papers (2024-12-18T12:10:33Z) - Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance [8.07701188057789]
We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data.
Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues.
Our method achieves up to 85% of the fully-supervised performance using only 10% labeled data.
arXiv Detail & Related papers (2024-08-21T12:13:18Z) - UnO: Unsupervised Occupancy Fields for Perception and Forecasting [33.205064287409094]
Supervised approaches leverage annotated object labels to learn a model of the world.
We learn to perceive and forecast a continuous 4D occupancy field with self-supervision from LiDAR data.
This unsupervised world model can be easily and effectively transferred to tasks.
arXiv Detail & Related papers (2024-06-12T23:22:23Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - 3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.
We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action.
Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z) - OccFlowNet: Towards Self-supervised Occupancy Estimation via
Differentiable Rendering and Occupancy Flow [0.6577148087211809]
We present a novel approach to occupancy estimation inspired by neural radiance field (NeRF) using only 2D labels.
We employ differentiable volumetric rendering to predict depth and semantic maps and train a 3D network based on 2D supervision only.
arXiv Detail & Related papers (2024-02-20T08:04:12Z) - CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting [15.392692128626809]
We propose CARFF, a method for predicting future 3D scenes given past observations.
We employ a two-stage training of Pose-Conditional-VAE and NeRF to learn 3D representations.
We demonstrate the utility of our method in scenarios using the CARLA driving simulator.
arXiv Detail & Related papers (2024-01-31T18:56:09Z) - SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations [76.45009891152178]
Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks.
We show, for the first time, that general representations learning can be achieved through the task of occupancy prediction.
Our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.
arXiv Detail & Related papers (2023-09-19T11:13:01Z) - Uncertainty-aware State Space Transformer for Egocentric 3D Hand
Trajectory Forecasting [79.34357055254239]
Hand trajectory forecasting is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems.
Existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications.
We set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view.
arXiv Detail & Related papers (2023-07-17T04:55:02Z) - Spatio-temporal Self-Supervised Representation Learning for 3D Point
Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks.
Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data.
STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.