Fugu-MT 論文翻訳(概要): PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

論文の概要: PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

arxiv url: http://arxiv.org/abs/2606.08708v1
Date: Sun, 07 Jun 2026 16:06:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.404479
Title: PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping
Title（参考訳）: PRPO:Token-Level Dynamic Advantage Reshapingによる知覚強化政策最適化
Authors: Qiming Li, Tianlun Li, Xiaolong Cheng, Hangyu Li, Ruiyan Gong, Kangning Niu, Kaitao Jiang, Mu Xu,
Abstract要約: RLVR(Reinforcement Learning with Verifiable Rewards)は、LVLM(Large Vision-Language Models)の推論能力向上に有効なパラダイムとなっている。既存のRLVR法は、全ての生成されたトークンに対して同一の学習信号を割り当てる軌道レベルの結果報酬に依存している。本稿では,重要な知覚トークンを明確に識別し,強化するトークンレベルの強化学習フレームワークである知覚強化政策最適化(PRPO)を提案する。
参考スコア（独自算出の注目度）: 5.89473045822308
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level outcome rewards, which assign identical learning signals across all generated tokens. This coarse-grained credit assignment is fundamentally mismatched to multimodal reasoning, where only a sparse subset of tokens is causally grounded in visual evidence. Consequently, these pivotal perceptual tokens receive weak supervision and are often overwhelmed by language priors or reasoning-template tokens. To address this limitation, we propose Perception-Reinforced Policy Optimization (PRPO), a token-level reinforcement learning framework that explicitly identifies and reinforces pivotal perceptual tokens within long-horizon multimodal reasoning trajectories. PRPO introduces Robust Visual Dependency (RVD), a principled metric that identifies tokens whose predictions are both visually grounded and perturbation-stable, filtering out brittle or noisy visual tokens. Based on RVD, we further propose Perceptual Advantage Reshaping (PAR), a token-level credit assignment technique that amplifies perceptually informative tokens while preserving stable gradients for non-perceptual tokens. Extensive experiments on seven multimodal reasoning benchmarks demonstrate that PRPO consistently outperforms strong LVLM baselines across both 3B and 7B model scales, achieving average gains of 23.3% and 21.1%, respectively. PRPO achieves state-of-the-art performance with improved training efficiency and stronger cross-task generalization. Our findings highlight the importance of fine-grained credit assignment for scalable multimodal reinforcement learning.
Abstract（参考訳）: RLVR(Reinforcement Learning with Verifiable Rewards)は、LVLM(Large Vision-Language Models)の推論能力向上に有効なパラダイムとなっている。しかし、既存のRLVR法は主に軌道レベルの結果報酬に依存しており、全ての生成されたトークンに対して同一の学習信号を割り当てている。この粗粒なクレジット代入は、基本的にマルチモーダルな推論と不一致であり、トークンのまばらなサブセットのみが視覚的証拠に因果的に根拠付けられている。その結果、これらの中心的な知覚トークンは、弱い監督を受けており、しばしば言語の先行性や推論タイミングのトークンに圧倒される。この制限に対処するために,長距離マルチモーダル推論軌道内の重要な知覚トークンを明確に識別し,強化するトークンレベルの強化学習フレームワークである知覚強化政策最適化(PRPO)を提案する。 PRPOはRobust Visual Dependency (RVD)を導入した。これは、予測が視覚的に接地され、摂動が安定しているトークンを識別し、不安定またはノイズの多いビジュアルトークンをフィルタリングする、原則付きメトリックである。さらに, RVDに基づいて, 非知覚的トークンに対する安定した勾配を維持しつつ, 知覚的情報的トークンを増幅するトークンレベルの信用割当手法であるPerceptual Advantage Reshaping (PAR)を提案する。 7つのマルチモーダル推論ベンチマークの大規模な実験により、PRPOは3Bモデルと7Bモデルの両方で強いLVLMベースラインを一貫して上回り、それぞれ23.3%と21.1%の平均ゲインを達成した。 PRPOはトレーニング効率の向上とクロスタスク一般化の強化により最先端の性能を達成する。本研究は,スケーラブルなマルチモーダル強化学習におけるきめ細かな信用割当の重要性を強調した。

論文の概要: PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

関連論文リスト