Related papers: LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

URL: http://arxiv.org/abs/2507.02813v1
Date: Thu, 03 Jul 2025 17:21:23 GMT
Title: LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
Authors: Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan,
Abstract summary: LangScene-X is a novel generative framework to unify and generate 3D consistent multi-modality information.<n>Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views.
Score: 21.828264119753502
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.

Related papers

A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding [0.48065284262360775]
Gaussian Splatting has emerged as a transformative technique for real-time 3D scene representation.<n>Its ability to render complex scenes with high fidelity has enabled progress across domains such as scene reconstruction, robotics, and interactive content creation.<n>This survey presents a structured review of current research efforts that combine language guidance with 3D Gaussian Splatting.
arXiv Detail & Related papers (2025-08-07T06:33:08Z)
Taking Language Embedded 3D Gaussian Splatting into the Wild [6.550474097747006]
We propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections.<n>Specifically, we first render multiple appearance images from the same viewpoint, then extract multi-appearance CLIP features.<n>We then propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features.
arXiv Detail & Related papers (2025-07-26T07:00:32Z)
GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond [56.677984098204696]
multimodal language models are driving the development of 3D Vision-Language Models (VLMs)<n>We propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations.<n>We present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images.
arXiv Detail & Related papers (2025-07-01T15:52:59Z)
Video Perception Models for 3D Scene Synthesis [109.5543506037003]
VIPScene is a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models.<n>VIPScene seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene.
arXiv Detail & Related papers (2025-06-25T16:40:17Z)
Tackling View-Dependent Semantics in 3D Language Gaussian Splatting [80.88015191411714]
LaGa establishes cross-view semantic connections by decomposing the 3D scene into objects.<n>It constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics.<n>Under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset.
arXiv Detail & Related papers (2025-05-30T16:06:32Z)
PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding [8.72555461868951]
3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks.<n>Previous methods cannot distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query.<n>We propose PanoGS, a novel and effective 3D panoptic open vocabulary scene understanding approach.
arXiv Detail & Related papers (2025-03-23T15:27:29Z)
LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding [42.750252190275546]
LangSurf is a language-Embedded Surface Field that aligns 3D language fields with the surface of objects.<n>Our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing.
arXiv Detail & Related papers (2024-12-23T15:12:20Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z)
SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z)
Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding [2.517953665531978]
We introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Our representation achieves the best visual quality and language querying accuracy across current language-embedded representations.
arXiv Detail & Related papers (2023-11-30T11:50:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.