InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation
- URL: http://arxiv.org/abs/2602.03242v1
- Date: Tue, 03 Feb 2026 08:22:13 GMT
- Title: InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation
- Authors: Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, Wei Wu, Yanyong Zhang,
- Abstract summary: InstaDrive is a novel framework that enhances driving video realism through two key advancements.<n>By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality.<n>Our project page is https://shanpoyang654.io/InstaDrive/page.html.
- Score: 53.47253633654885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare but safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems. Our project page is https://shanpoyang654.github.io/InstaDrive/page.html.
Related papers
- ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask [65.36169132836518]
ConsisDrive is an identity-preserving driving world model designed to enforce temporal consistency at the instance level.<n>Our framework incorporates two key components: Instance-Masked Attention and Instance-Masked Loss.<n>ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.
arXiv Detail & Related papers (2026-02-03T07:28:44Z) - DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving [49.11389494068169]
We present DrivingGen, the first comprehensive benchmark for generative driving world models.<n>DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources.<n>General models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality.
arXiv Detail & Related papers (2026-01-04T13:36:21Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain [25.44145750579996]
We introduce SafeMVDrive, a framework to generate safety-critical, multi-view driving videos grounded in real-world domains.<n>We first enhance scene understanding ability of the trajectory generator by incorporating visual context.<n>We introduce a two-stage, controllable trajectory generation mechanism that produces collision-evasion trajectories.
arXiv Detail & Related papers (2025-05-23T10:45:43Z) - DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT [33.943125216555316]
We present DrivingWorld, a GPT-style world model for autonomous driving.<n>We propose a next-state prediction strategy to model temporal coherence between consecutive frames.<n>We also propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues.
arXiv Detail & Related papers (2024-12-27T07:44:07Z) - Physical Informed Driving World Model [47.04423342994622]
DrivePhysica is an innovative model designed to generate realistic driving videos that adhere to essential physical principles.<n>We achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks.
arXiv Detail & Related papers (2024-12-11T14:29:35Z) - Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.<n>CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.<n>CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z) - MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control [68.74166535159311]
MagicDrive-V2 is a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control.<n>It enables multi-view driving video synthesis with $3.3times$ resolution and $4times$ frame count (compared to current SOTA), rich contextual control, and geometric controls.
arXiv Detail & Related papers (2024-11-21T03:13:30Z) - Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey [61.39993881402787]
World models and video generation are pivotal technologies in the domain of autonomous driving.
This paper investigates the relationship between these two technologies.
By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions.
arXiv Detail & Related papers (2024-11-05T08:58:35Z) - DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation.
Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information.
DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.