Fugu-MT 論文翻訳(概要): Discretizing Reward Models

論文の概要: Discretizing Reward Models

arxiv url: http://arxiv.org/abs/2606.21795v1
Date: Fri, 19 Jun 2026 23:13:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 03:08:27.79691
Title: Discretizing Reward Models
Title（参考訳）: リワードモデルの離散化
Authors: Vijay Viswanathan, Shiqi Wang, Devamanyu Hazarika, Chirag Nagpal, Tongshuang Wu, Graham Neubig, Yuning Mao,
Abstract要約: 多くの人気報酬モデルが過敏であり、異なるスコアを等しく良い応答に割り当てていることを示す。我々は「識別能力」と「特異性」の尺度を用いた報酬モデルの評価を提案する。
参考スコア（独自算出の注目度）: 70.71071807050916
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite their widespread use, the role of reward models in shaping reinforcement learning is poorly understood. Reward models offer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges. Unlike "verifiable rewards" which typically produce binary scores, reward models typically produce continuous scores, allowing them to be sensitive to fine-grained differences in responses. However, we show this apparent strength is a serious weakness: many popular reward models are oversensitive, assigning different scores to equally good responses. Theoretically, we show that seemingly perfect reward models can be highly oversensitive; empirically, this oversensitivity can lead to bad policies. In place of existing notions of "reward model accuracy," we propose evaluating reward models using distinct measures of "discriminative ability" and "specificity" (the complement of oversensitivity). As a solution, we describe a training-free algorithm that uses Monte Carlo dropout on any neural reward model to produce discrete reward clusters. Theoretically, we prove there exist discretizations that reduce oversensitivity at minimal expense of discriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to less reward hacking and better policies than training on the original rewards.
Abstract（参考訳）: 広く使われているにもかかわらず、強化学習の形成における報酬モデルの役割は理解されていない。リワードモデルは、検証者や人間の審査員がいない場合、応答品質を自動的に見積もる、という誘惑的な約束を提供する。通常バイナリスコアを生成する「検証可能な報酬」とは異なり、報酬モデルは通常、連続的なスコアを生成し、反応のきめ細かい相違に敏感にすることができる。しかし、この明らかな強みは深刻な弱点であり、多くの人気報酬モデルは過敏であり、異なるスコアを等しく良い反応に割り当てている。理論的には、一見完璧な報酬モデルは非常に過敏であり、実証的には、この過敏性は悪いポリシーにつながる可能性がある。従来の「逆モデル精度」の概念の代わりに、「差別能力」と「特異性」(過敏性の補完)の異なる尺度を用いて報酬モデルを評価することを提案する。解決策として,任意のニューラル報酬モデル上でモンテカルロのドロップアウトを用いて,離散的な報酬クラスタを生成する学習自由アルゴリズムについて述べる。理論的には、差別能力の最小の犠牲で過敏を軽減できる差別が存在することが証明されている; 制御されたRL設定と自然なRL設定の両方において、差別的な報酬は、元の報酬のトレーニングよりも報酬のハッキングを少なくし、より良いポリシーをもたらすことが実証的に示される。

論文の概要: Discretizing Reward Models

関連論文リスト