Fugu-MT 論文翻訳(概要): UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

論文の概要: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

arxiv url: http://arxiv.org/abs/2604.14967v2
Date: Fri, 17 Apr 2026 02:39:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 13:38:49.399095
Title: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
Title（参考訳）: UniDoc-RL:階層的アクションとDense Rewardsを備えた粗視的RAG
Authors: Jun Wang, Shuo Tan, Zelong Sun, Tiancheng Gu, Yongle Zhao, Ziyong Feng, Kaicheng Yang, Zhiwu Lu,
Abstract要約: Retrieval-Augmented Generation (RAG)は、LVLM(Large Vision-Language Models)を拡張して、外部の視覚的知識を提供する。統合強化学習フレームワークUniDoc-RLを提案し,LVLMエージェントが協調して検索,再評価,能動的視覚知覚,推論を行う。 3つのベンチマークの実験では、UniDoc-RLは最先端のベースラインを一貫して上回り、以前のRLベースの手法よりも最大17.7%向上している。
参考スコア（独自算出の注目度）: 16.669801835057424
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.
Abstract（参考訳）: Retrieval-Augmented Generation (RAG)は、LVLM(Large Vision-Language Models)を拡張して、外部の視覚的知識を提供する。しかし、既存のビジュアルRAGシステムは一般に複雑な推論に不可欠な細粒度の視覚的意味論を無視する一般的な検索信号に依存している。この制限に対処するために、LVLMエージェントが共同で検索、再ランク付け、アクティブな視覚知覚、推論を行う統合強化学習フレームワークUniDoc-RLを提案する。 UniDoc-RLは、階層的なアクション空間を持つシーケンシャルな意思決定問題として、視覚情報取得を定式化する。具体的には、粗い文書検索から、きめ細かな画像選択や活動領域の収穫に至るまでの視覚的証拠を段階的に洗練し、無関係な内容の抑制と情報密度の高い領域への参加を可能にする。エンド・ツー・エンドの効果的なトレーニングには、各アクションに対するタスク・アウェア・インスペクションを提供する密集したマルチ・リワード・スキームを導入する。グループ相対ポリシー最適化(GRPO)に基づいて、UniDoc-RLは、異なる値ネットワークに頼ることなく、エージェントの振る舞いを複数の目的と整合させる。このトレーニングパラダイムをサポートするため、我々は、詳細なアクションアノテーションを用いて高品質な推論軌道の包括的データセットをキュレートする。 3つのベンチマークの実験では、UniDoc-RLは最先端のベースラインを一貫して上回り、以前のRLベースの手法よりも最大17.7%向上している。

論文の概要: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

関連論文リスト