Fugu-MT 論文翻訳(概要): Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

論文の概要: Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

arxiv url: http://arxiv.org/abs/2604.00528v2
Date: Thu, 02 Apr 2026 06:20:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:09.374417
Title: Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
Title（参考訳）: ゼロショット3Dビジュアルグラウンドのための視覚言語モデルを備えたエージェントフレームワークThink, Act, Build
Authors: Haibo Wang, Zihao Lin, Zhiyang Xu, Lifu Huang,
Abstract要約: 3D Visual Groundingは、自然言語記述を通じてオブジェクトを3Dシーンにローカライズすることを目的としている。生のRGB-Dストリーム上で直接動作する2次元から3次元の再生パラダイムである"Think, Act, Build (TAB)"を提案する。厳密なVLMセマンティックトラッキングによる多視点カバレッジ障害を克服するために,セマンティックアンコレッド幾何拡張を導入する。
参考スコア（独自算出の注目度）: 34.1504914582344
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.
Abstract（参考訳）: 3Dビジュアルグラウンド(3D-VG)は、自然言語による3Dシーン内のオブジェクトのローカライズを目的としている。 VLM(Vision-Language Models)を活用した最近の進歩はゼロショットの可能性を探っているが、通常、前処理された3Dポイントクラウドに依存した静的ワークフローに悩まされ、基本的には提案マッチングに基礎を落としている。この依存を回避すべく、我々のコアモチベーションは、複雑な空間意味論を解決するために2次元のVLMを活用することであり、3D構造をインスタンス化するための決定論的多視点幾何に依存している。本稿では,3D-VGタスクを生RGB-Dストリーム上で直接動作する生成的2D-to-3D再構成パラダイムとして再構成する動的エージェントフレームワークである"Think, Act, Build (TAB)"を提案する。具体的には、特殊な3D-VG技術により、VLMエージェントが動的に視覚ツールを起動し、2Dフレーム間でターゲットを追跡し、再構築する。重要なことは、厳密なVLMセマンティックトラッキングによる多視点カバレッジの欠陥を克服するために、まず基準ビデオクリップにターゲットを固定し、次にマルチビュー幾何を利用して、観測されていないフレーム間で空間的位置を伝播するメカニズムであるセマンティック・アンチョレッド・ジオメトリ・エクスパンジョン(Semantic-Anchored Geometric Expansion)を導入する。これにより、カメラパラメータを介してこれらのマルチビュー機能を集約し、2Dビジュアルキューを直接3D座標にマッピングすることで、ターゲットの3D表現を“構築”することができる。さらに、厳密な評価を確保するため、既存のベンチマークにおける参照曖昧さやカテゴリエラーなどの欠陥を特定し、不正なクエリを手動で洗練する。 ScanReferとNr3Dの大規模な実験は、我々のフレームワークが完全にオープンソースモデルに依存しており、以前のゼロショットメソッドよりも大幅に優れており、完全に教師されたベースラインを超えていることを示している。

論文の概要: Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

関連論文リスト