Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding
- URL: http://arxiv.org/abs/2602.15734v1
- Date: Tue, 17 Feb 2026 17:10:13 GMT
- Title: Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding
- Authors: Guile Wu, David Huang, Bingbing Liu, Dongfeng Bai,
- Abstract summary: Existing 3D scene understanding methods mostly distill language features from 2D foundation models into 3D feature fields.<n>We propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry.<n>Our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.
- Score: 22.218083641125137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill language features from a 2D foundation model into our 3D scene model. In addition, we integrate geometric distillation into feature field distillation to transfer geometric knowledge from a geometry foundation model to our 3D scene representations via depth correlation regularization and pattern consistency regularization. These components work together to synergistically model the appearance, semantics, and geometry of the 3D scene within a unified framework. Extensive experiments demonstrate that our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.
Related papers
- Unified Semantic Transformer for 3D Scene Understanding [55.415468022487005]
We introduce UNITE, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model.<n>Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry.<n>We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models.
arXiv Detail & Related papers (2025-12-16T12:49:35Z) - EA3D: Online Open-World 3D Object Extraction from Streaming Videos [55.48835711373918]
We present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction.<n>Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge.<n>A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding.
arXiv Detail & Related papers (2025-10-29T03:56:41Z) - IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction [82.53307702809606]
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions.<n>We propose InstanceGrounded Geometry Transformer (IGGT) to unify the knowledge for both spatial reconstruction and instance-level contextual understanding.
arXiv Detail & Related papers (2025-10-26T14:57:44Z) - WorldGrow: Generating Infinite 3D World [75.81531067447203]
We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance.<n>We propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis.<n>Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity.
arXiv Detail & Related papers (2025-10-24T17:39:52Z) - 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation [17.294440057314812]
Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks.<n>We propose Geometric Distillation, a framework that injects human-inspired geometric cues into pretrained VLMs.<n>Our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs.
arXiv Detail & Related papers (2025-06-11T15:56:59Z) - Shape from Semantics: 3D Shape Generation from Multi-View Semantics [30.969299308083723]
Existing 3D reconstruction methods utilize guidances such as 2D images, 3D point clouds, shape contours and single semantics to recover the 3D surface.<n>We propose a novel 3D modeling task called Shape from Semantics'', which aims to create 3D models whose geometry and appearance are consistent with the given text semantics when viewed from different views.
arXiv Detail & Related papers (2025-02-01T07:51:59Z) - Textured Mesh Saliency: Bridging Geometry and Texture for Human Perception in 3D Graphics [50.23625950905638]
We present a new dataset for textured mesh saliency, created through an innovative eye-tracking experiment in a six degrees of freedom (6-DOF) VR environment.<n>Our proposed model predicts saliency maps for textured mesh surfaces by treating each triangular face as an individual unit and assigning a saliency density value to reflect the importance of each local surface region.
arXiv Detail & Related papers (2024-12-11T08:27:33Z) - Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics [16.446659867133977]
PartGS is a self-supervised part-aware reconstruction framework that integrates 2D Gaussians and superquadrics to parse objects and scenes into an interpretable decomposition.<n>Our approach demonstrates superior performance compared to state-of-the-art methods across extensive experiments on the DTU, ShapeNet, and real-world datasets.
arXiv Detail & Related papers (2024-08-20T12:30:37Z) - Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.<n>voxelization infers per-object occupancy probabilities at individual spatial locations.<n>Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding [11.118857208538039]
We present Foundation Model Embedded Gaussian Splatting (S), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS)
Results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by 10.2 percent on open-vocabulary language-based object detection.
This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments.
arXiv Detail & Related papers (2024-01-03T20:39:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.