GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
- URL: http://arxiv.org/abs/2512.12751v1
- Date: Sun, 14 Dec 2025 16:23:51 GMT
- Title: GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
- Authors: Zhenya Yang, Zhe Liu, Yuxiang Lu, Liping Hou, Chenxuan Miao, Siyi Peng, Bailan Feng, Xiang Bai, Hengshuang Zhao,
- Abstract summary: We propose GenieDrive, a framework for physics-aware driving video generation.<n>Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation.<n>Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.
- Score: 80.1493315900789
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.
Related papers
- VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control [83.92729346325163]
VerseCrafter is a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics.<n>Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud.<n>These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos.
arXiv Detail & Related papers (2026-01-08T17:28:52Z) - CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving [26.379817613036597]
CVD-STORM is a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE)<n>Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics.<n> Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics.
arXiv Detail & Related papers (2025-10-09T08:41:58Z) - Physical Informed Driving World Model [47.04423342994622]
DrivePhysica is an innovative model designed to generate realistic driving videos that adhere to essential physical principles.<n>We achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks.
arXiv Detail & Related papers (2024-12-11T14:29:35Z) - MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control [68.74166535159311]
MagicDrive-V2 is a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control.<n>It enables multi-view driving video synthesis with $3.3times$ resolution and $4times$ frame count (compared to current SOTA), rich contextual control, and geometric controls.
arXiv Detail & Related papers (2024-11-21T03:13:30Z) - DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation [32.19534057884047]
We introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors.
To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios.
arXiv Detail & Related papers (2024-10-17T14:07:46Z) - DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation.
Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information.
DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z) - DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes [15.506076058742744]
We propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation.<n>We introduce perspective guidance and integrate object-wise position encoding to incorporate local 3D correlation.<n>We can autoregressively generate long videos (over 200 frames) using a model trained in short sequences, achieving superior quality compared to the baseline in 16-frame video evaluations.
arXiv Detail & Related papers (2024-09-06T03:09:58Z) - Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [116.31344506738816]
We present a novel framework, textbfDiffusion4D, for efficient and scalable 4D content generation.
We develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets.
Our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency.
arXiv Detail & Related papers (2024-05-26T17:47:34Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer [57.506654943449796]
We propose an efficient, sparse-controlled video-to-4D framework named SC4D that decouples motion and appearance.
Our method surpasses existing methods in both quality and efficiency.
We devise a novel application that seamlessly transfers motion onto a diverse array of 4D entities.
arXiv Detail & Related papers (2024-04-04T18:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.