Related papers: Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval

Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval

URL: http://arxiv.org/abs/2509.15871v1
Date: Fri, 19 Sep 2025 11:11:36 GMT
Title: Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval
Authors: Liwei Liao, Xufeng Li, Xiaoyun Zheng, Boning Liu, Feng Gao, Ronggang Wang,
Abstract summary: 3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts.<n>We propose underlineGrounding via underlineView underlineRetrieval (GVR) to transform 3DVG as a 2D retrieval task.<n>Our method achieves state-of-the-art visual grounding performance while avoiding per-scene training.
Score: 30.111912463361275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.

Related papers

Sparse Multiview Open-Vocabulary 3D Detection [27.57172918603858]
3D object detection has traditionally been solved by training to detect a fixed set of categories.<n>In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting.<n>Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion.
arXiv Detail & Related papers (2025-09-19T12:22:24Z)
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness [73.72335146374543]
We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure.<n>Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks.
arXiv Detail & Related papers (2025-04-02T16:59:55Z)
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining [100.23919762298227]
Currently, all existing methods rely on 2D or textual modalities during training or together at inference.<n>We introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates on 3DGS.<n>We propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes.
arXiv Detail & Related papers (2025-03-23T12:50:25Z)
From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs [64.28181017898369]
LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views.<n>LIFT-GS achieves state-of-the-art results with $25.7%$ mAP on open-vocabulary instance segmentation.<n>Remarkably, pretraining effectively multiplies fine-tuning datasets by 2X, demonstrating strong scaling properties.
arXiv Detail & Related papers (2025-02-27T18:59:11Z)
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding [10.81711535075112]
3D Visual Grounding aims to locate objects in 3D scenes based on textual descriptions, essential for applications like augmented reality and robotics.<n>We introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data.<n>SeeGround represents 3D scenes as a hybrid of query-aligned rendered images and spatially enriched text descriptions, bridging the gap between 3D data and 2D-VLMs input formats.
arXiv Detail & Related papers (2024-12-05T17:58:43Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z)
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding [23.672405624011873]
We propose a module to consolidate the 3D visual stream by 2D clues synthesized from point clouds. We empirically show their aptitude to boost the quality of the learned visual representations. Our proposed module, dubbed as Look Around and Refer (LAR), significantly outperforms the state-of-the-art 3D visual grounding techniques on three benchmarks.
arXiv Detail & Related papers (2022-11-25T17:12:08Z)
Learning 3D Scene Priors with 2D Supervision [37.79852635415233]
We propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth. Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories. Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction.
arXiv Detail & Related papers (2022-11-25T15:03:32Z)
Multi-View Transformer for 3D Visual Grounding [64.30493173825234]
We propose a Multi-View Transformer (MVT) for 3D visual grounding. We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together.
arXiv Detail & Related papers (2022-04-05T12:59:43Z)
3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images. First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training. Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration. Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.