Related papers: OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

URL: http://arxiv.org/abs/2405.20337v1
Date: Thu, 30 May 2024 17:59:42 GMT
Title: OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving
Authors: Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu,
Abstract summary: We propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
Score: 62.54220021308464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: https://github.com/wzzheng/OccSora.

Related papers

DreamDrive: Generative 4D Scene Modeling from Street View Images [55.45852373799639]
We present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references. We then render 3D-consistent driving videos via Gaussian splatting.
arXiv Detail & Related papers (2024-12-31T18:59:57Z)
Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model [83.31688383891871]
We propose a Spatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct real-world scenes. Stag-1 constructs continuous 4D point cloud scenes using surround-view data from autonomous vehicles. It decouples spatial-temporal relationships and produces coherent driving videos.
arXiv Detail & Related papers (2024-12-06T18:59:56Z)
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion [22.11178016375823]
DimensionX is a framework designed to generate 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames.
arXiv Detail & Related papers (2024-11-07T18:07:31Z)
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation [32.19534057884047]
We introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios.
arXiv Detail & Related papers (2024-10-17T14:07:46Z)
Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [116.31344506738816]
We present a novel framework, textbfDiffusion4D, for efficient and scalable 4D content generation. We develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets. Our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency.
arXiv Detail & Related papers (2024-05-26T17:47:34Z)
DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion. DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z)
TC4D: Trajectory-Conditioned Text-to-4D Generation [94.90700997568158]
We propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion.
arXiv Detail & Related papers (2024-03-26T17:55:11Z)
OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [67.49461023261536]
We learn a new framework of learning a world model, OccWorld, in the 3D Occupancy space. We simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. OccWorld produces competitive planning results without using instance and map supervision.
arXiv Detail & Related papers (2023-11-27T17:59:41Z)
Persistent Nature: A Generative Model of Unbounded 3D Worlds [74.51149070418002]
We present an extendable, planar scene layout grid that can be rendered from arbitrary camera poses via a 3D decoder and volume rendering. Based on this representation, we learn a generative world model solely from single-view internet photos. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation.
arXiv Detail & Related papers (2023-03-23T17:59:40Z)
Multi-frame sequence generator of 4D human body motion [0.0]
We propose a generative auto-encoder-based framework, which encodes, global locomotion including translation and rotation, and multi-frame temporal motion as a single latent space vector. Our results validate the ability of the model to reconstruct 4D sequences of human morphology within a low error bound. We also illustrate the benefits of the approach for 4D human motion prediction of future frames from initial human frames.
arXiv Detail & Related papers (2021-06-07T13:56:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.