Fugu-MT 論文翻訳(概要): AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

論文の概要: AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

arxiv url: http://arxiv.org/abs/2605.25901v1
Date: Mon, 25 May 2026 14:29:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:20.332297
Title: AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models
Title（参考訳）: AgentGrounder:マルチモーダル言語モデルを用いたZero-Shot 3D Visual Pointcloud Grounding
Authors: Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin,
Abstract要約: 3Dビジュアルグラウンド(3DVG)は、AIを具現化する上で不可欠な機能であり、自然言語の記述に基づいて、エージェントがオブジェクトを3Dシーンにローカライズする必要がある。タスク固有の3Dトレーニングを必要とせずに,色のついた点クラウド上で直接動作する,ゼロショットの3Dビジュアルグラウンドティングフレームワークである。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.
Abstract（参考訳）: 3Dビジュアルグラウンド(3DVG)は、AIを具現化する上で不可欠な機能であり、自然言語の記述に基づいて、エージェントがオブジェクトを3Dシーンにローカライズする必要がある。最近のゼロショット法は2次元視覚言語モデル(LVLM)を利用している。しかし、それらはしばしば既存のマルチビュー画像のセットに依存し、標準的な3Dセグメンテーションツールによって提供される限定的な意味と空間的詳細に苦慮する。タスク固有の3Dトレーニングを必要とせずに,色のついた点クラウド上で直接動作する,ゼロショットの3Dビジュアルグラウンドティングフレームワークである。 1) オブジェクトルックアップテーブル(OLT)をインスタンスID,セマンティックラベル,3Dバウンディングボックスで構築するオフラインステージ,(2) 各クエリを分解し,OLTから関連する候補のみを検索し,幾何学的評価を行い,追加の視覚的エビデンス(色,素材,視点に敏感な手がかり)が必要な場合に,画像のレンダリングをオンデマンドでトリガーするオンラインツール駆動エージェント。固定されたアンカーターゲットマッチングパイプラインと比較して、この設計はカスケードマッチングエラーを低減し、無関係なオブジェクトでオーバーロードされるプロンプトを回避してコンテキストウィンドウ効率を向上させる。 ScanReferとNr3Dをゼロショット設定で評価し、ScanReferの+2.5% Acc@0.5、Nr3Dの+6.3%、Nr3Dのビュー非依存クエリの+6.3%など、SeeGroundに対する一貫した改善を観察した。これらの結果から, 選択的検索, 幾何学的推論, 適応的視覚検査の組み合わせは, オープンボキャブラリ3Dグラウンドティングの実用的で堅牢な基礎となることが示唆された。私たちのコードはhttps://github.com/be2rlab/AgentGrounder.comから入手可能です。

論文の概要: AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

関連論文リスト