Fugu-MT 論文翻訳(概要): MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

論文の概要: MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

arxiv url: http://arxiv.org/abs/2604.09167v1
Date: Fri, 10 Apr 2026 09:51:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.811321
Title: MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
Title（参考訳）: MAG-3D:3次元理解のためのマルチエージェント接地推論
Authors: Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu, Gao Huang,
Abstract要約: 我々は,既製の視覚言語モデルを用いたグラウンドド3D推論のためのトレーニング不要なマルチエージェントフレームワークMAG-3Dを提案する。本稿では,タスクを分解して全体推論プロセスを編成する計画エージェントと,広範囲な3次元シーン観測から自由形式の3Dグラウンドと関連するフレーム検索を行うグラウンド処理エージェントと,実行可能なプログラムを通して柔軟な幾何学的推論と明示的な検証を行うコーディングエージェントを提案する。
参考スコア（独自算出の注目度）: 25.15914325538431
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.
Abstract（参考訳）: 視覚言語モデル(VLM)はマルチモーダルな理解と推論において高い性能を達成しているが、3Dシーンでの推論は未解明のままである。オープンなクエリに答えるために、モデルはまず複雑なシーンにおけるクエリ関連オブジェクトや領域を特定し、その空間的および幾何学的関係を推論する必要があります。近年のアプローチは、接地型3次元推論の強い可能性を示している。しかし、それらはしばしばドメイン内のチューニングや手作りの推論パイプラインに依存し、柔軟性とゼロショットの一般化を新しい環境に制限する。本研究では,既製のVLMを用いたグラウンドド3D推論のためのトレーニング不要なマルチエージェントフレームワークMAG-3Dを提案する。 MAG-3Dは、タスク固有のトレーニングや固定的な推論手順に頼る代わりに、専門家エージェントを動的にコーディネートして、3D推論の重要な課題に対処する。具体的には、タスクを分解して全体推論プロセスを編成する計画エージェントと、広範囲な3Dシーンの観察から自由な3Dグラウンドと関連するフレーム検索を行う接地エージェントと、実行可能なプログラムを通して柔軟な幾何学的推論と明示的な検証を行う符号化エージェントを提案する。このマルチエージェントのコラボレーティブデザインは、多様なシーンにまたがるフレキシブルなトレーニングフリーな3Dグラウンド推論を可能にし、挑戦的なベンチマークで最先端のパフォーマンスを達成する。

論文の概要: MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

関連論文リスト