OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving
- URL: http://arxiv.org/abs/2311.16038v1
- Date: Mon, 27 Nov 2023 17:59:41 GMT
- Title: OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving
- Authors: Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan,
Jiwen Lu
- Abstract summary: We learn a new framework of learning a world model, OccWorld, in the 3D Occupancy space.
We simultaneously predict the movement of the ego car and the evolution of the surrounding scenes.
OccWorld produces competitive planning results without using instance and map supervision.
- Score: 67.49461023261536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding how the 3D scene evolves is vital for making decisions in
autonomous driving. Most existing methods achieve this by predicting the
movements of object boxes, which cannot capture more fine-grained scene
information. In this paper, we explore a new framework of learning a world
model, OccWorld, in the 3D Occupancy space to simultaneously predict the
movement of the ego car and the evolution of the surrounding scenes. We propose
to learn a world model based on 3D occupancy rather than 3D bounding boxes and
segmentation maps for three reasons: 1) expressiveness. 3D occupancy can
describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D
occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3)
versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the
modeling of the world evolution, we learn a reconstruction-based scene
tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the
surrounding scenes. We then adopt a GPT-like spatial-temporal generative
transformer to generate subsequent scene and ego tokens to decode the future
occupancy and ego trajectory. Extensive experiments on the widely used nuScenes
benchmark demonstrate the ability of OccWorld to effectively model the
evolution of the driving scenes. OccWorld also produces competitive planning
results without using instance and map supervision. Code:
https://github.com/wzzheng/OccWorld.
Related papers
- OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving [62.54220021308464]
We propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving.
OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
arXiv Detail & Related papers (2024-05-30T17:59:42Z) - SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding [37.47195477043883]
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents.
We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes.
We demonstrate this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS) for 3D vision-language learning.
arXiv Detail & Related papers (2024-01-17T17:04:35Z) - SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction [77.15924044466976]
We propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences.
We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene.
We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations.
arXiv Detail & Related papers (2023-11-21T17:59:14Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z) - SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections [49.802462165826554]
We present SceneDreamer, an unconditional generative model for unbounded 3D scenes.
Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations.
arXiv Detail & Related papers (2023-02-02T18:59:16Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Curiosity-driven 3D Scene Structure from Single-image Self-supervision [22.527696847086574]
Previous work has demonstrated learning isolated 3D objects from 2D-only self-supervision.
Here we set out to extend this to entire 3D scenes made out of multiple objects, including their location, orientation and type.
The resulting system converts 2D images of different virtual or real images into complete 3D scenes, learned only from 2D images of those scenes.
arXiv Detail & Related papers (2020-12-02T14:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.