Related papers: A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

URL: http://arxiv.org/abs/2507.06719v1
Date: Wed, 09 Jul 2025 10:20:38 GMT
Title: A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding
Authors: Zhenyang Liu, Sixiao Zheng, Siyu Chen, Cairong Zhao, Longfei Liang, Xiangyang Xue, Yanwei Fu,
Abstract summary: Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
Score: 78.99798110890157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.'' This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.

Related papers

SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z)
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning [68.4209681278336]
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions.<n>Current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals.<n>We propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping.
arXiv Detail & Related papers (2025-03-30T03:40:35Z)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation [13.614206918726314]
We propose techniques to enhance the model's ability to localize and disambiguate target objects.<n>Our approach achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity.
arXiv Detail & Related papers (2024-12-09T16:04:32Z)
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z)
LERF: Language Embedded Radiance Fields [35.925752853115476]
Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF. LERFs learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays. After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time.
arXiv Detail & Related papers (2023-03-16T17:59:20Z)
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z)
LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.