Fugu-MT 論文翻訳(概要): When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

論文の概要: When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

arxiv url: http://arxiv.org/abs/2510.15421v1
Date: Fri, 17 Oct 2025 08:17:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-20 20:17:34.534408
Title: When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs
Title（参考訳）: MLLMにおけるアクティブ推論の限界を明らかにする
Authors: Hongcheng Liu, Pingjie Wang, Yuhao Wang, Siqu Ou, Yanfeng Wang, Yu Wang,
Abstract要約: MLLM(Multimodal large language model)は、幅広いベンチマークで強力な機能を示している。既存の評価のほとんどは受動的推論に重点を置いており、モデルが完全な情報の下でステップバイステップの推論を行う。 MLLMは不完全な情報の下で行方不明の証拠を積極的に取得できるのか? 我々はMLLMに、タスク固有の事前情報のない候補プールから目標画像を選択することにより、欠落した証拠を積極的に取得し、不完全な情報の下で決定を反復的に洗練するよう要求する。 20個の優れたMLLMを評価したところ、アクティブな推論ラグがパッシブな設定ではるかに遅れていることが分かり、かなりの余地があることが示唆された。
参考スコア（独自算出の注目度）: 29.198301196459834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.
Abstract（参考訳）: MLLM(Multimodal large language model)は、幅広いベンチマークで強力な機能を示している。しかし、既存の評価のほとんどは受動的推論に重点を置いており、そこではモデルが完全な情報の下でステップバイステップの推論を行う。このセットアップは、見るだけでは不十分な現実世界の使い方と間違っています。 MLLMは不完全な情報の下で行方不明の証拠を積極的に取得できるのか? このギャップを埋めるために、MLLMはタスク固有の事前情報なしで候補プールから目標画像を選択することにより、欠落した証拠を積極的に取得し、不完全な情報の下で決定を反復的に洗練する必要がある。体系的な研究を支援するため,MLLMにおける能動的推論評価のための知覚指向画像と知識指向画像のベンチマークであるGuessBenchを提案する。我々は,20個の優れたMLLMを評価し,それより遥かに遅れた能動的推論における性能を受動的に評価し,改善の余地があることを示唆した。さらなる分析は、微粒な認識とタイムリーな意思決定を重要な課題として挙げる。アブレーション研究は、知覚的拡張がより小さなモデルに利益をもたらすのに対して、思考指向の手法はモデルサイズ全体で一貫した利得をもたらすことを示している。これらの結果は,今後の多モーダル能動推論研究の方向性を示唆している。

論文の概要: When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

関連論文リスト