ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask
- URL: http://arxiv.org/abs/2602.03213v3
- Date: Tue, 10 Feb 2026 04:10:55 GMT
- Title: ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask
- Authors: Zhuoran Yang, Yanyong Zhang,
- Abstract summary: ConsisDrive is an identity-preserving driving world model designed to enforce temporal consistency at the instance level.<n>Our framework incorporates two key components: Instance-Masked Attention and Instance-Masked Loss.<n>ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.
- Score: 65.36169132836518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.
Related papers
- InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation [53.47253633654885]
InstaDrive is a novel framework that enhances driving video realism through two key advancements.<n>By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality.<n>Our project page is https://shanpoyang654.io/InstaDrive/page.html.
arXiv Detail & Related papers (2026-02-03T08:22:13Z) - DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving [49.11389494068169]
We present DrivingGen, the first comprehensive benchmark for generative driving world models.<n>DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources.<n>General models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality.
arXiv Detail & Related papers (2026-01-04T13:36:21Z) - Optimization-Guided Diffusion for Interactive Scene Generation [52.23368750264419]
We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling.<n>We show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes.<n>Our approach can also generate $5times$ more near-collision frames with a time-to-collision under three seconds.
arXiv Detail & Related papers (2025-12-08T15:56:18Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - InstDrive: Instance-Aware 3D Gaussian Splatting for Driving Scenes [30.149975412543444]
In this paper, we present InstDrive, an instance-aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene.<n>We use masks generated by SAM as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives.<n>At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel-based loss.
arXiv Detail & Related papers (2025-08-16T11:17:31Z) - AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving [29.420887070252274]
We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log.<n>At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions.<n>Our model incorporates visibility reasoning and physically rigid regularization to enhance robustness.
arXiv Detail & Related papers (2025-07-16T11:10:57Z) - Physical Informed Driving World Model [47.04423342994622]
DrivePhysica is an innovative model designed to generate realistic driving videos that adhere to essential physical principles.<n>We achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks.
arXiv Detail & Related papers (2024-12-11T14:29:35Z) - Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.<n>CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.<n>CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.