Fugu-MT 論文翻訳(概要): PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

論文の概要: PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

arxiv url: http://arxiv.org/abs/2605.13467v1
Date: Wed, 13 May 2026 12:55:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.053424
Title: PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
Title（参考訳）: PDCR : 視覚言語推論における認識分解信頼度
Authors: Hee Suk Yoon, Eunseop Yoon, Ji Woo Hong, SooHwan Eom, Gwanhyeong Koo, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo,
Abstract要約: Reinforcement Learning with Verifiable Rewards (RLVR) は伝統的に、粗末で結果に基づく信号に依存している。近年の研究では,高コストな外部モデルを必要としないステップレベルのガイダンスを提供することで,詳細なモデル固有の信号を提供することで,言語推論のトレーニングを効果的に向上することが示された。一助文には有効であるが,この大域的な報酬を視覚言語推論(V-L)に適用することは準最適戦略である。本稿では、報酬構造とタスクの不均一な性質を整合させることにより、この問題を解決するフレームワークであるパーセプション分解信頼回復(PDCR:Perception-Decomposed Confidence Reward)を提案する。
参考スコア（独自算出の注目度）: 80.94559742826083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR) は伝統的に、粗末で結果に基づく信号に依存している。最近の研究は、きめ細かなモデル内在的な信号を提供することで、コストのかかる外部モデルなしでステップレベルのガイダンスを提供することで、言語推論訓練を効果的に改善することを示している。一助文には有効であるが,この大域的な報酬を視覚言語推論(V-L)に適用することは,不均一な視覚知覚と密接なテキスト推論の混合であるため,準最適戦略であることがわかった。このグローバルな正規化は混合誘起信号劣化を引き起こし、視覚ステップの訓練信号は支配的なテキストステップによって統計的に歪められる。本稿では、報酬構造とタスクの不均一な性質を整合させることにより、この問題を解決するフレームワークであるパーセプション分解信頼回復(PDCR:Perception-Decomposed Confidence Reward)を提案する。 PDCRはまず教師なしのスキル分解を行い、視覚的依存を定量化するためのモデル内部のVisual Dependence Scoreを導入し、クラスタリングアルゴリズムを適用して知覚と推論のステップを分離する。これに基づいて、PDCRは、各スキルクラスタ内の信頼ゲインを正規化することにより、分解された利点を算出する。このクラスタ内正規化は、知覚と推論の両方に対して安定かつ正しくスケールされた信号を提供する。 PDCRは,鍵となるV-L推論ベンチマークにおいて,単純で大域的・大域的・大域的・大域的・大域的・大域的なベースラインよりも優れていることを示す。

論文の概要: PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

関連論文リスト