Fugu-MT 論文翻訳(概要): Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

論文の概要: Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2604.01840v1
Date: Thu, 02 Apr 2026 09:53:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.662376
Title: Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
Title（参考訳）: すべてのトークンが等しく見えるわけではない:大規模ビジョンランゲージモデルに対する知覚を包含したポリシー最適化
Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin,
Abstract要約: Perception-Grounded Policy Optimization (PGPO)は、トークンレベルでのメリットを動的に再評価する、新しいきめ細かなクレジット割り当てフレームワークである。 PGPOは,言語的先行音からの勾配雑音を抑えつつ,視覚的に依存するトークンの学習信号を積極的に増幅することを示す。理論的および実証的な分析は、PGPOが勾配の分散を効果的に減少させ、訓練の崩壊を防ぎ、頑健で知覚的なマルチモーダル推論のための強力な正則化剤として機能することを確認する。
参考スコア（独自算出の注目度）: 38.47027398567909
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.
Abstract（参考訳）: Reinforcement Learning from Verifiable Rewards (RLVR)は、LVLM(Large Vision-Language Models)において高度な推論を持っているが、一般的なフレームワークは基本的な方法論上の欠陥に悩まされている。このギャップを埋めるため、視覚条件付きとテキストのみの予測分布間のKL(Kullback-Leibler)分散を介して視覚入力の因果情報ゲインを定量化するために、textit{Token Visual Dependency} を定式化する。トークンレベルでのメリットを動的に再認識する新しいきめ細かな信用割当フレームワークであるPerception-Grounded Policy Optimization (PGPO)を紹介します。 PGPOは、しきい値付き大量保存機構を通じて、視覚的に依存するトークンの学習信号を積極的に増幅し、言語的先行から勾配雑音を抑える。 Qwen2.5-VLシリーズに基づく大規模な実験では、7つの挑戦的なマルチモーダル推論ベンチマークがPGPOが平均18.7%の速度でモデルを加速することを示した。理論的および実証的な分析は、PGPOが勾配の分散を効果的に低減し、トレーニングの崩壊を防止し、頑健で知覚的なマルチモーダル推論のための強力な正則化剤として機能することを確認する。コードはhttps://github.com/Yzk1114/PGPOで公開される。

論文の概要: Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

関連論文リスト