Fugu-MT 論文翻訳(概要): Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

論文の概要: Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

arxiv url: http://arxiv.org/abs/2603.28618v1
Date: Mon, 30 Mar 2026 16:03:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.500559
Title: Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
Title（参考訳）: あなたと見る:マルチモーダル推論のための知覚推論共進化
Authors: Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, Jing Shao,
Abstract要約: 本稿では,共有ポリシを備えたデュアルロールRLVRフレームワークであるPRCO(Perception-Reasoning Coevolution)を紹介する。 PRCOは,ベースモデルと比較して,平均精度で7ポイント以上,モデルスケール間で一貫した改善が得られた。
参考スコア（独自算出の注目度）: 30.60184048111503
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は,マルチモーダル大言語モデル(MLLM)の推論能力を大幅に向上させた。しかし、既存のRLVRアプローチは一般的に、最終回答のみに基づいて共有報酬を使用して知覚と推論の両方を更新する結果駆動の最適化に依存している。この共有報酬は、クレジットの割り当てを曖昧にし、しばしば推論パターンを改善し、上流の視覚的証拠抽出の精度を確実に向上させるのに失敗する。このような認識ボトルネックに対処するために,共有ポリシを備えたデュアルロールRLVRフレームワークであるPRCO(Perception-Reasoning Coevolution)を導入する。 PRCOは2つの協力的な役割で構成されており、質問に合わせた証拠のキャプションを生成するオブザーバーと、このキャプションに基づいて最終回答を予測するソルバーである。重要なことに、PRCOはロール固有の報酬信号を使用する:ソルバーは最終回答の検証結果の報酬を使って最適化され、オブザーバーはソルバーの下流の成功に由来する実用的報酬を受け取る。 8つの挑戦的マルチモーダル推論ベンチマークによる大規模な実験により、PRCOはベースモデルと比較して平均精度でモデルスケールを7ポイント以上改善し、以前のオープンソースのRLチューニングベースラインを上回った。

論文の概要: Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

関連論文リスト