Fugu-MT 論文翻訳(概要): Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

論文の概要: Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

arxiv url: http://arxiv.org/abs/2605.22072v1
Date: Thu, 21 May 2026 07:10:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.128681
Title: Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention
Title（参考訳）: Faithful-MR1: AnchoringおよびReinforcecing Visual AttentionによるFithful Multimodal Reasoning
Authors: Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye,
Abstract要約: 検証可能な報奨付き強化学習(RLVR)は,大規模言語モデルにおける複雑な推論を促進するための有望なパラダイムとして登場した。忠実なマルチモーダル推論の両面に対処するために,視覚的注意を固定し,強化するトレーニングフレームワークであるFithful-MR1を提案する。
参考スコア（独自算出の注目度）: 41.546578522790114
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は、大規模言語モデルにおける複雑な推論を促進するための有望なパラダイムとして現れており、最近の研究は、RLVRをマルチモーダルな大規模言語モデル(MLLM)にまで拡張している。タスク関連の視覚的証拠に対する忠実な認識と、推論中にその証拠を忠実に利用することで、マルチモーダルベンチマークでは不満足な結果がもたらされる。特に、既存の知覚監督は、画像領域にネイティブではなく、テキストによる記述で操作されることが多く、忠実な使用は概ね見過ごされ、正しく認識された証拠が削除されたり、推論中に矛盾する知覚に反する不一致を露呈する。これらのギャップを埋めるために、我々は、忠実なマルチモーダル推論の両面に対処するために視覚的注意を固定し強化するトレーニングフレームワークであるFithful-MR1を提案する。 Anchoringステージは、知覚を明示的な事前推論サブタスクに変換し、テキスト記述ではなく、画像領域に対して専用の<Focus>トークンの注意を監督する。強化段階は、反ファクト的なイメージ介入を通じて忠実な使用を露呈し、視覚が重要な部分の視覚的注意を集中する答えの正しい軌跡に報いる。広範囲な実験により、Fithful-MR1はQwen2.5-VL-Instruct 3Bと7Bのバックボーンにおいて、トレーニングデータを大幅に減らしながら、最近のマルチモーダル推論ベースラインより優れていることが示された。

論文の概要: Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

関連論文リスト