GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
- URL: http://arxiv.org/abs/2512.23180v1
- Date: Mon, 29 Dec 2025 03:40:05 GMT
- Title: GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
- Authors: Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, Hesheng Wang,
- Abstract summary: We propose a novel unified Driving World Models framework based on 3D Gaussian scene representation.<n>Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive.<n>In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM.
- Score: 27.566010743301675
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.
Related papers
- Unified Semantic Transformer for 3D Scene Understanding [55.415468022487005]
We introduce UNITE, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model.<n>Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry.<n>We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models.
arXiv Detail & Related papers (2025-12-16T12:49:35Z) - A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding [2.934809709666499]
Gaussian Splatting has emerged as a transformative technique for real-time 3D scene representation.<n>Its ability to render complex scenes with high fidelity has enabled progress across domains such as scene reconstruction, robotics, and interactive content creation.<n>This survey presents a structured review of current research efforts that combine language guidance with 3D Gaussian Splatting.
arXiv Detail & Related papers (2025-08-07T06:33:08Z) - Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians [27.19772539224761]
We introduce Can3Tok, the first 3D scene-level variational autoencoder capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding.<n>We propose a general pipeline for 3D scene data processing to address scale inconsistency issue.
arXiv Detail & Related papers (2025-08-02T18:43:45Z) - GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond [56.677984098204696]
multimodal language models are driving the development of 3D Vision-Language Models (VLMs)<n>We propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations.<n>We present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images.
arXiv Detail & Related papers (2025-07-01T15:52:59Z) - Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation [51.36926306499593]
Prometheus is a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds.<n>We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm.
arXiv Detail & Related papers (2024-12-30T17:44:23Z) - DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation Modeling [23.06464506261766]
We present DreamScape, a method for generating 3D scenes from text.<n>We use 3D Gaussian Guide that encodes semantic primitives, spatial transformations and relationships from text using LLMs.<n>DreamScape achieves state-of-the-art performance, enabling high-fidelity, controllable 3D scene generation.
arXiv Detail & Related papers (2024-04-14T12:13:07Z) - 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation [51.64796781728106]
We propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior to 2D diffusion model and the global 3D information of the current scene.
Our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
arXiv Detail & Related papers (2024-03-14T14:31:22Z) - JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues [68.76032126906743]
We introduce JM3D, a comprehensive approach integrating point cloud, text, and image.
Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text.
Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning.
arXiv Detail & Related papers (2023-10-14T06:13:20Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.