Fugu-MT 論文翻訳(概要): Reward Hacking in Rubric-Based Reinforcement Learning

論文の概要: Reward Hacking in Rubric-Based Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.12474v1
Date: Tue, 12 May 2026 17:54:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:57.073031
Title: Reward Hacking in Rubric-Based Reinforcement Learning
Title（参考訳）: ルーブリック型強化学習におけるリワードハック
Authors: Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He,
Abstract要約: そこでは,ルールをトレーニング検証器に対して最適化するが,フロンティアの3人の審査員の家族間パネルに対して評価を行う。我々のフレームワークは、検証失敗とルーブリック設計の制限という2つの違いの源を分離している。医学領域と科学領域全体において、弱い検証器は参照検証器に転送されない大きなプロキシ・リワードゲインを生成する。
参考スコア（独自算出の注目度）: 23.418394508756464
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.
Abstract（参考訳）: 検証可能な報酬による強化学習は、数学やコーディングといった領域でのトレーニング後の強力な向上を可能にしているが、多くのオープンエンド設定はルーリックベースの報酬に依存している。そこでは,ルールをトレーニング検証器に対して最適化するが,フロンティアの3人の裁判官の家族間パネルに対して評価し,単一の評価器への依存を低減させる。本枠組みでは, 基準検証器が拒否するルーリック基準を認定する検証器故障と, 強いルーリックに基づく検証器でさえ, ルーリックフリーの判定器が全体の評価を悪化させる応答を優先するルーリック設計制限の2つの要因を分離する。医学領域や科学領域全体では、弱い検証器は基準検証器に転送されない大きなプロキシ・リワードゲインを生成する。より強い検証器は大幅に減少するが、排除しない。また、ポリシーログ確率に基づく検証不要な診断である自己内部化ギャップを導入し、参照検証の品質をトラッキングし、弱い検証器を用いて訓練したポリシーが改善しなくなることを検知する。ルーリックベースの検証者はRLチェックポイントを好むが、ルーリックフリーの審査員はベースモデルを好む。これらの不一致は、事実の正しさ、簡潔さ、妥当性、全体的な品質の低下と共に、完全性と存在に基づく基準に集中する利益と一致している。これらの結果は、より強力な検証によって報酬のハッキングが軽減されることを示しているが、それ自身はルーリックゲインがより大きな品質ゲインに対応することを保証していない。

論文の概要: Reward Hacking in Rubric-Based Reinforcement Learning

関連論文リスト