View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs
- URL: http://arxiv.org/abs/2512.09215v1
- Date: Wed, 10 Dec 2025 00:59:17 GMT
- Title: View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs
- Authors: Yuanyuan Liu, Haiyang Mei, Dongyang Zhan, Jiayue Zhao, Dongsheng Zhou, Bo Dong, Xin Yang,
- Abstract summary: 3D visual grounding identifies objects in 3D scenes from language descriptions.<n>Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms to VLM processing.<n>We propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning.
- Score: 19.27758108925572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.
Related papers
- HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model [14.277165215664425]
Large vision-language models (VLMs) have shown significant promise for 3D scene understanding.<n>Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space.<n>We propose a novel hierarchical multimodal representation for 3D scene reasoning.
arXiv Detail & Related papers (2025-11-28T08:06:20Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models [0.0]
ZING-3D is a framework that generates a rich semantic representation of a 3D scene in a zero-shot manner.<n>It also enables incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications.<n>Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.
arXiv Detail & Related papers (2025-10-24T00:52:33Z) - SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding [40.60812160987424]
3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions.<n>We propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning.<n> Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively.
arXiv Detail & Related papers (2025-08-28T13:15:37Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation [14.046423852723615]
We introduce a novel 3D Gaussian Splatting based hard visual prompting approach to generate diverse viewpoints around target objects.<n>Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts.<n>This training-free strategy integrates seamlessly with prior hard visual prompts, enriching object-descriptive features.
arXiv Detail & Related papers (2025-04-20T14:39:27Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [91.94869042117621]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models [39.488763757826426]
2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks.<n>Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results.<n>We propose a vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding.
arXiv Detail & Related papers (2025-01-02T18:59:59Z) - LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models [57.92316645992816]
Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space.<n>We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs)<n>We demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.
arXiv Detail & Related papers (2024-12-03T06:15:04Z) - VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding [57.04804711488706]
3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding.
We present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images.
arXiv Detail & Related papers (2024-10-17T17:59:55Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding [57.64806066986975]
3D Visual Grounding aims at localizing 3D object based on textual descriptions.
We propose a novel visual programming approach for zero-shot open-vocabulary 3DVG.
arXiv Detail & Related papers (2023-11-26T19:01:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.