InstDrive: Instance-Aware 3D Gaussian Splatting for Driving Scenes
- URL: http://arxiv.org/abs/2508.12015v2
- Date: Wed, 29 Oct 2025 07:05:00 GMT
- Title: InstDrive: Instance-Aware 3D Gaussian Splatting for Driving Scenes
- Authors: Hongyuan Liu, Haochen Yu, Bochao Zou, Jianfei Jiang, Qiankun Liu, Jiansheng Chen, Huimin Ma,
- Abstract summary: In this paper, we present InstDrive, an instance-aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene.<n>We use masks generated by SAM as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives.<n>At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel-based loss.
- Score: 30.149975412543444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reconstructing dynamic driving scenes from dashcam videos has attracted increasing attention due to its significance in autonomous driving and scene understanding. While recent advances have made impressive progress, most methods still unify all background elements into a single representation, hindering both instance-level understanding and flexible scene editing. Some approaches attempt to lift 2D segmentation into 3D space, but often rely on pre-processed instance IDs or complex pipelines to map continuous features to discrete identities. Moreover, these methods are typically designed for indoor scenes with rich viewpoints, making them less applicable to outdoor driving scenarios. In this paper, we present InstDrive, an instance-aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene. We use masks generated by SAM as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives. At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel-based loss. A lightweight static codebook further bridges continuous features and discrete identities without requiring data pre-processing or complex optimization. Quantitative and qualitative experiments demonstrate the effectiveness of InstDrive, and to the best of our knowledge, it is the first framework to achieve 3D instance segmentation in dynamic, open-world driving scenes.More visualizations are available at our project page.
Related papers
- ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask [65.36169132836518]
ConsisDrive is an identity-preserving driving world model designed to enforce temporal consistency at the instance level.<n>Our framework incorporates two key components: Instance-Masked Attention and Instance-Masked Loss.<n>ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.
arXiv Detail & Related papers (2026-02-03T07:28:44Z) - MADrive: Memory-Augmented Driving Scene Modeling [8.604680698214196]
MADrive is a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods.<n>It replaces observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank.<n>The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations.
arXiv Detail & Related papers (2025-06-26T17:41:07Z) - SIRE: SE(3) Intrinsic Rigidity Embeddings [16.630400019100943]
We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes.<n>Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss.<n>Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.
arXiv Detail & Related papers (2025-03-10T18:00:30Z) - DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance [5.113012982922924]
We present DualDiff, a conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences.<n>To improve synthesis of fine-grained foreground objects, we propose a Foreground-Aware Mask (FGM) denoising loss function.<n>We also develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise.
arXiv Detail & Related papers (2025-03-05T17:31:45Z) - Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning [24.511628941825116]
We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning framework framework.<n>It reconstructs the implicit cognitive chain inherent in human driving, covering scene understanding, meta-action reasoning, behavior interpretation analysis, motion planning and control.<n>It achieves state-of-the-art performance from scene understanding to end-to-end driving, as well as robust generalization on the CARLA Bench2Drive benchmark.
arXiv Detail & Related papers (2025-02-19T09:50:44Z) - DreamDrive: Generative 4D Scene Modeling from Street View Images [55.45852373799639]
We present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction.<n>Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references.<n>We then render 3D-consistent driving videos via Gaussian splatting.
arXiv Detail & Related papers (2024-12-31T18:59:57Z) - UniScene: Unified Occupancy-centric Driving Scene Generation [73.22859345600192]
We introduce UniScene, the first unified framework for generating three key data forms - semantic occupancy, video, and LiDAR.<n>UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps.<n>Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation.
arXiv Detail & Related papers (2024-12-06T21:41:52Z) - MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes [72.02827211293736]
MagicDrive3D is a novel framework for controllable 3D street scene generation.<n>It supports multi-condition control, including road maps, 3D objects, and text descriptions.<n>It generates diverse, high-quality 3D driving scenes, supports any-view rendering, and enhances downstream tasks like BEV segmentation.
arXiv Detail & Related papers (2024-05-23T12:04:51Z) - DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes [57.12439406121721]
We present DrivingGaussian, an efficient and effective framework for surrounding dynamic autonomous driving scenes.
For complex scenes with moving objects, we first sequentially and progressively model the static background of the entire scene.
We then leverage a composite dynamic Gaussian graph to handle multiple moving objects.
We further use a LiDAR prior for Gaussian Splatting to reconstruct scenes with greater details and maintain panoramic consistency.
arXiv Detail & Related papers (2023-12-13T06:30:51Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.