Fugu-MT 論文翻訳(概要): A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

論文の概要: A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

arxiv url: http://arxiv.org/abs/2507.06719v1
Date: Wed, 09 Jul 2025 10:20:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-10 17:37:43.548761
Title: A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding
Title（参考訳）: オープンボキャブラリ3次元視覚グラウンドのためのLLM駆動空間推論を用いたニューラル表現フレームワーク
Authors: Zhenyang Liu, Sixiao Zheng, Siyu Chen, Cairong Zhao, Longfei Liang, Xiangyang Xue, Yanwei Fu,
Abstract要約: Open-vocabulary 3D visual groundingは、自由形式の言語クエリに基づいてターゲットオブジェクトをローカライズすることを目的としている。既存の言語フィールド手法は、言語クエリにおける空間的関係を利用してインスタンスを正確にローカライズするのに苦労する。本研究では,大規模言語モデル(LLM)に基づく空間推論を用いたニューラル表現に基づく新しいフレームワークであるSpatialReasonerを提案する。
参考スコア（独自算出の注目度）: 78.99798110890157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.'' This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.
Abstract（参考訳）: Open-vocabulary 3D visual groundingは、自由形式の言語クエリに基づいてターゲットオブジェクトをローカライズすることを目的としている。ニューラルネットワークによる3D言語フィールドの学習は、限られた視点から3Dシーンの正確な理解を可能にし、複雑な環境でターゲットオブジェクトのローカライズを容易にする。しかし、既存の言語フィールドメソッドは、'`the book on the chair'のような言語クエリの空間的関係を利用して、インスタンスを正確にローカライズするのに苦労している。「''この制限は主に、言語クエリと3Dシーンの両方における空間的関係に関する不適切な推論から生じる。本研究では,大規模言語モデル(LLM)をベースとした空間推論を用いたニューラル表現に基づく新しいフレームワークであるSpatialReasonerを提案する。言語クエリにおける空間推論を可能にするために、SpatialReasonerは、LLMを微調整して空間関係を捉え、ターゲット、アンカー、空間関係の指示を明示的に推論する。 3Dシーンにおける空間的推論を可能にするために、SpatialReasonerは視覚特性(オパシティとカラー)を取り入れ、階層的特徴場を構築する。このフィールドは、Segment Anything Model (SAM)を介して抽出された蒸留CLIP特徴とマスクを使用して言語とインスタンスの特徴を表現する。次に、推論された命令を用いてフィールドを階層的にクエリし、言語クエリの空間関係に基づいてターゲット3Dインスタンスをローカライズする。大規模な実験により、我々のフレームワークは異なる神経表現にシームレスに統合され、空間的推論能力を高めながら、3次元の視覚的グラウンドにおいてベースラインモデルより優れていることが示された。

論文の概要: A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

関連論文リスト