Fugu-MT 論文翻訳(概要): Pursuing Minimal Sufficiency in Spatial Reasoning

論文の概要: Pursuing Minimal Sufficiency in Spatial Reasoning

arxiv url: http://arxiv.org/abs/2510.16688v1
Date: Sun, 19 Oct 2025 02:29:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.09672
Title: Pursuing Minimal Sufficiency in Spatial Reasoning
Title（参考訳）: 空間推論における最小効率の確保
Authors: Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, Ming-Hsuan Yang,
Abstract要約: 空間的推論、言語を3Dで理解する能力は、ビジョンモデルにとって永続的な課題である。 2次元の故障に起因する不適切な3D理解能力と冗長な3D情報である。この原理を実装したデュアルエージェントフレームワークであるMS(空間空間)を導入する。
参考スコア（独自算出の注目度）: 42.564463357503875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.
Abstract（参考訳）: 3次元理解における言語の基礎となる空間的推論は、視覚言語モデル(VLM)にとって永続的な課題である。 2D中心の事前学習から生じる不適切な3D理解能力と、冗長な3D情報によって引き起こされる推論失敗の2つの基本的なボトルネックを特定した。これらの問題に対処するために、まず、与えられた質問に答える前に情報の最小十分集合(MSS)を構築する: \textit{expert model} による3次元知覚結果のコンパクトな選択。本稿では,MSSR(Minimal Sufficient Space Reasoner)という,この原理を実装したデュアルエージェントフレームワークを紹介する。知覚エージェントは、多目的認識ツールボックスを使用して3Dシーンをプログラム的にクエリして、言語接地方向を頑健に抽出する新しいSOGモジュールを含む十分な情報を抽出する。その後、Reasoning Agentは、この情報を反復的に洗練し、最小限の情報を追求し、冗長な詳細を抽出し、MSSがキュレーションされるまでクローズドループで行方不明の情報を要求します。拡張実験により,本手法は,有効性と最小性の両方を明示的に追従することにより,精度を大幅に向上し,2つの挑戦的ベンチマークで最先端の性能を達成することを示した。さらに、我々のフレームワークは解釈可能な推論パスを生成し、将来のモデルに高品質なトレーニングデータを提供する。ソースコードはhttps://github.com/gyj155/mssr.comで入手できる。

論文の概要: Pursuing Minimal Sufficiency in Spatial Reasoning

関連論文リスト