Related papers: OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

URL: http://arxiv.org/abs/2311.16038v1
Date: Mon, 27 Nov 2023 17:59:41 GMT
Title: OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving
Authors: Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, Jiwen Lu
Abstract summary: We learn a new framework of learning a world model, OccWorld, in the 3D Occupancy space. We simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. OccWorld produces competitive planning results without using instance and map supervision.
Score: 67.49461023261536
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding how the 3D scene evolves is vital for making decisions in autonomous driving. Most existing methods achieve this by predicting the movements of object boxes, which cannot capture more fine-grained scene information. In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. We propose to learn a world model based on 3D occupancy rather than 3D bounding boxes and segmentation maps for three reasons: 1) expressiveness. 3D occupancy can describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3) versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the modeling of the world evolution, we learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. We then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory. Extensive experiments on the widely used nuScenes benchmark demonstrate the ability of OccWorld to effectively model the evolution of the driving scenes. OccWorld also produces competitive planning results without using instance and map supervision. Code: https://github.com/wzzheng/OccWorld.

Related papers

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation [59.675030933810106]
We present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%.
arXiv Detail & Related papers (2025-01-24T18:59:51Z)
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving [62.54220021308464]
We propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
arXiv Detail & Related papers (2024-05-30T17:59:42Z)
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding [37.47195477043883]
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes. We demonstrate this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS) for 3D vision-language learning.
arXiv Detail & Related papers (2024-01-17T17:04:35Z)
SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction [77.15924044466976]
We propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations.
arXiv Detail & Related papers (2023-11-21T17:59:14Z)
SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving. We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z)
SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections [49.802462165826554]
We present SceneDreamer, an unconditional generative model for unbounded 3D scenes. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations.
arXiv Detail & Related papers (2023-02-02T18:59:16Z)
3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations. A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
Curiosity-driven 3D Scene Structure from Single-image Self-supervision [22.527696847086574]
Previous work has demonstrated learning isolated 3D objects from 2D-only self-supervision. Here we set out to extend this to entire 3D scenes made out of multiple objects, including their location, orientation and type. The resulting system converts 2D images of different virtual or real images into complete 3D scenes, learned only from 2D images of those scenes.
arXiv Detail & Related papers (2020-12-02T14:17:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.