Fugu-MT 論文翻訳(概要): Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

論文の概要: Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

arxiv url: http://arxiv.org/abs/2510.13694v1
Date: Wed, 15 Oct 2025 15:51:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.749163
Title: Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
Title（参考訳）: 安定RLHFのための情報理論リワードモデリング:リワードハックの検出と軽減
Authors: Yuchun Miao, Liang Ding, Sen Zhang, Rong Bao, Lefei Zhang, Dacheng Tao,
Abstract要約: 本稿では,インフォメーション・ボトルネックの原理に基づく情報理論報酬モデリングフレームワークを提案する。 InfoRMは、報酬の一般化を緩和するために、嗜好に無関係な情報をフィルタリングする。 IBLは分散レベルの正規化であり、そのような偏差を罰し、最適化の展望を効果的に拡張する。
参考スコア（独自算出の注目度）: 78.69179041551014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the success of Reinforcement Learning from Human Feedback (RLHF) in aligning language models with human values, reward hacking-or reward over-optimization-remains a major challenge. We identify two key obstacles to its mitigation: (1) reward misgeneralization in reward modeling, where reward models overfit to spurious, preference-irrelevant features; and (2) the lack of suitable regularization during RL optimization, as existing token-level constraints often over-restrict the policy space. To address these issues, we propose InfoRM, an information-theoretic reward modeling framework based on the Information Bottleneck (IB) principle, which filters out preference-irrelevant information to alleviate reward misgeneralization. We further observe that reward-hacked responses manifest as pronounced outliers in InfoRM's IB latent space, measured by Mahalanobis distance from the SFT-induced distribution. Motivated by this, we introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape while maintaining alignment. We prove that IBL is theoretically equivalent to the pessimistic RL objective within the IB latent space. Finally, we present Mahalanobis Outlier Probability (MOP), a statistical metric for quantifying reward hacking severity, enabling principled hyperparameter tuning and online mitigation such as early stopping. Extensive experiments across diverse LLMs and datasets confirm the generality of our findings, the effectiveness of InfoRM and IBL, and the reliability of MOP as a diagnostic tool-collectively advancing the state of RLHF.
Abstract（参考訳）: Reinforcement Learning from Human Feedback (RLHF) の成功にもかかわらず、言語モデルと人間の価値を整合させることは、報酬のハッキングや報酬の過度な最適化において大きな課題である。 1)報酬モデルが刺激的かつ好ましくない特徴に過度に適合する報酬モデル、(2)既存のトークンレベルの制約が政策空間を過度に制限しているため、RL最適化における適切な規則化の欠如である。これらの問題に対処するため,インフォメーション・ボトルネック(IB)の原理に基づく情報理論報酬モデリングフレームワークであるInfoRMを提案する。さらに,SFT誘導分布からマハラノビス距離で測定したInfoRMのIB潜伏空間において,報奨応答が顕著な外れ値であることが確認された。このような偏差をペナルティ化し,アライメントを維持しながら最適化環境を効果的に拡張する分散レベルの正規化であるIBLを導入する。 IBL は IB 潜在空間内の悲観的 RL の目的と理論的に等価であることを示す。最後に,Mahalanobis Outlier Probability (MOP)を提案する。これは報奨ハッキングの重大度を定量化するための統計指標であり,原則的ハイパーパラメータチューニングと早期停止などのオンライン緩和を可能にする。各種LLMおよびデータセットにわたる広範囲にわたる実験により,本研究の汎用性,InfoRMとIBLの有効性,および診断ツールとしてのMOPの信頼性がRLHFの状態を総合的に向上することを確認した。

論文の概要: Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

関連論文リスト