ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding
- URL: http://arxiv.org/abs/2512.03370v1
- Date: Wed, 03 Dec 2025 02:06:09 GMT
- Title: ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding
- Authors: Lingjun Zhao, Yandong Luo, James Hay, Lu Gan,
- Abstract summary: We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models.<n>Existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision.
- Score: 7.610505486431266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.
Related papers
- C3G: Learning Compact 3D Representations with 2K Gaussians [55.04010158339562]
Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding.<n>We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations.
arXiv Detail & Related papers (2025-12-03T17:59:05Z) - ODG: Occupancy Prediction Using Dual Gaussians [38.9869091446875]
Occupancy prediction infers fine-grained 3D geometry and semantics from camera images of the surrounding environment.<n>Existing methods either adopt dense grids as scene representation, or learn the entire scene using a single set of sparse queries.<n>We present ODG, a hierarchical dual sparse Gaussian representation to effectively capture complex scene dynamics.
arXiv Detail & Related papers (2025-06-11T06:03:03Z) - GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding [44.68350305790145]
GaussTR is a novel Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding.<n>Experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance of 12.27 mIoU, along with a 40% reduction in training time.<n>These results highlight the efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in autonomous driving and embodied agents.
arXiv Detail & Related papers (2024-12-17T18:59:46Z) - GaussianAD: Gaussian-Centric End-to-End Autonomous Driving [23.71316979650116]
Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs.<n>Most existing methods adopt dense representations (e.g., bird's eye view) or sparse representations (e.g., instance boxes) for decision-making.<n>This paper explores a Gaussian-centric end-to-end autonomous driving framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene.
arXiv Detail & Related papers (2024-12-13T18:59:30Z) - GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction [55.60972844777044]
3D semantic occupancy prediction is an important task for robust vision-centric autonomous driving.<n>Most existing methods leverage dense grid-based scene representations, overlooking the spatial sparsity of the driving scenes.<n>We propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied.
arXiv Detail & Related papers (2024-12-05T17:59:58Z) - NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model [57.92709692193132]
NovelGS is a diffusion model for Gaussian Splatting given sparse-view images.
We leverage the novel view denoising through a transformer-based network to generate 3D Gaussians.
arXiv Detail & Related papers (2024-11-25T07:57:17Z) - Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos [58.22272760132996]
We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained.
We propose Dynamic Gaussian Marbles, which consist of three core modifications that target the difficulties of the monocular setting.
We evaluate on the Nvidia Dynamic Scenes dataset and the DyCheck iPhone dataset, and show that Gaussian Marbles significantly outperforms other Gaussian baselines in quality.
arXiv Detail & Related papers (2024-06-26T19:37:07Z) - GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction [70.65250036489128]
3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene.
We propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians.
GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
arXiv Detail & Related papers (2024-05-27T17:59:51Z) - 3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting [58.95801720309658]
In this paper, we present an implicit surface reconstruction method with 3D Gaussian Splatting (3DGS), namely 3DGSR.<n>The key insight is incorporating an implicit signed distance field (SDF) within 3D Gaussians to enable them to be aligned and jointly optimized.<n>Our experimental results demonstrate that our 3DGSR method enables high-quality 3D surface reconstruction while preserving the efficiency and rendering quality of 3DGS.
arXiv Detail & Related papers (2024-03-30T16:35:38Z) - GIR: 3D Gaussian Inverse Rendering for Relightable Scene Factorization [62.13932669494098]
This paper presents a 3D Gaussian Inverse Rendering (GIR) method, employing 3D Gaussian representations to factorize the scene into material properties, light, and geometry.
We compute the normal of each 3D Gaussian using the shortest eigenvector, with a directional masking scheme forcing accurate normal estimation without external supervision.
We adopt an efficient voxel-based indirect illumination tracing scheme that stores direction-aware outgoing radiance in each 3D Gaussian to disentangle secondary illumination for approximating multi-bounce light transport.
arXiv Detail & Related papers (2023-12-08T16:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.