Fugu-MT 論文翻訳(概要): Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

論文の概要: Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

arxiv url: http://arxiv.org/abs/2604.06777v1
Date: Wed, 08 Apr 2026 07:48:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.404682
Title: Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
Title（参考訳）: マルチモーダルエージェントポリシー最適化による画像思考のための推論-アクションギャップのブリッジ
Authors: Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuchen Zhou, Xiaobo Xia, Yuanyu Wan, Lijun Zhang, Tat-Seng Chua,
Abstract要約: MLLM(Multimodal Large Language Models)は,マルチターン推論において視覚ツールを積極的に呼び出すことによって,イメージで考えるモデルにインセンティブを与えている。結果に基づく報酬を頼りにする一般的な強化学習の実践は、テキストの妥当性が経営幹部の失敗を隠蔽するという事実を無視します。マルチモーダルエージェントポリシー最適化(MAPO)を導入し、テキスト推論とモデルが生成する視覚行動のギャップを埋める。
参考スコア（独自算出の注目度）: 89.68681087743876
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images'' by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model's multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)の最近の進歩は、マルチターン推論において視覚的ツールを積極的に呼び出すことによって、モデルに「イメージで考える」ことへのインセンティブを与えている。結果に基づく報酬を頼りにする一般的な強化学習(RL)の実践は、テキストの妥当性が経営上の失敗を隠蔽することが多いという事実を無視している。この推論-アクションの相違は、マルチターン推論プロセスを通じて蓄積されるノイズを導入し、モデルのマルチモーダル推論能力を著しく劣化させ、トレーニングの崩壊につながる可能性がある。本稿では,Multimodal Chain-of-Thought (MCoT) 内のモデルによって生成されるテキスト推論と視覚行動のギャップを埋めるマルチモーダルエージェントポリシー最適化(MAPO)を提案する。具体的には、MAPOは、ツールの使用によって得られた視覚コンテンツに対して、明示的なテキスト記述を生成するようにモデルを委任する。次に、これらの記述と実際の観察とのセマンティックアライメントをタスク報酬と組み合わせた、新たな利点推定手法を提案する。勾配のばらつきを本質的に低減するMAPOの背後にある理論的根拠を正当化する理論的な研究結果が得られ,本手法が複数の視覚的推論ベンチマークにおいて優れた性能を達成できることが実証された。

論文の概要: Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

関連論文リスト