Fugu-MT 論文翻訳(概要): MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

論文の概要: MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.13579v1
Date: Wed, 15 Apr 2026 07:39:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.437756
Title: MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning
Title（参考訳）: MM-Doc-R1:マルチターン強化学習による長文視覚質問応答訓練エージェント
Authors: Jiahang Lin, Kai Hu, Binghai Wang, Yuhao Zhou, Zhiheng Xi, Honglin Guo, Shichun Liu, Junzhe Wang, Shihan Dou, Enyu Zhou, Hang Yan, Zhenhua Han, Tao Gui, Qi Zhang, Xuanjing Huang,
Abstract要約: 長文の視覚的質問応答に対処するために,エージェント型視覚認識ワークフローを利用する新しいフレームワークMM-Doc-R1を紹介する。 GRPOのような既存のマルチターン強化学習(RL)アルゴリズムにおけるベースライン推定バイアスに対処する、類似性に基づくポリシー最適化(SPO)を提案する。 MMLongbench-Docベンチマークの実験では、MM-Doc-R1が以前のベースラインを10.4%上回る結果となった。
参考スコア（独自算出の注目度）: 74.07254720088926
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state's baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.
Abstract（参考訳）: 従来の検索-拡張生成(RAG)システムは、シングルパス検索のため、長いドキュメント上の複雑なマルチホップクエリに悩まされることが多い。 MM-Doc-R1は、エージェント型視覚認識ワークフローを用いて、反復的な情報発見と合成を通じて、長文の視覚的質問応答に対処する新しいフレームワークである。エージェントの情報探索能力を高めるために,GRPOのような既存のマルチターン強化学習(RL)アルゴリズムのベースライン推定バイアスに対処する,類似性に基づくポリシー最適化(SPO)を提案する。我々の中核的な洞察は、マルチターンRLでは、より意味論的に類似した2つの軌道がより正確になるということである。これを利用して、SPOは複数の軌道にまたがる報酬の類似性の重み付けによりより正確なベースラインを計算するが、GRPOは初期状態のベースラインを全ての中間状態に不適切に適用している。これにより、エージェントに対してより安定的で正確な学習信号が提供され、GRPOを超える優れたトレーニング性能が得られます。 MMLongbench-Docベンチマークの実験では、MM-Doc-R1が以前のベースラインを10.4%上回る結果となった。さらに、SPOはGRPOよりも優れた性能を示し、Qwen3-8Bでは5.0%、Qwen3-4Bでは6.1%向上した。これらの結果は,複雑で長期にわたる視覚的質問応答の最先端化における,統合フレームワークと新しいトレーニングアルゴリズムの有効性を浮き彫りにしたものである。

論文の概要: MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

関連論文リスト