Fugu-MT 論文翻訳(概要): HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

論文の概要: HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

arxiv url: http://arxiv.org/abs/2606.03131v1
Date: Tue, 02 Jun 2026 04:18:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:04.755146
Title: HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models
Title（参考訳）: HARVE:ロバストリワードモデルのためのハック対応リワードベクトル編集
Authors: Shuang Liu, Yuxuan Bo, Qiuyang Zhao, Caiyue Huang, Xiaorong Chen, Yanguang Liu, Mengnan Du,
Abstract要約: リワードモデルは、大きな言語モデル(LLM)アライメントの中心であるが、ハックに対する報酬には弱いままである。 HarVEはスカラー報酬モデルのためのトレーニング不要報酬ヘッド編集手法である。実験により、モデルはハッキングの堅牢性を改善し、微調整ベースラインを上回り、報酬モデルの汎用能力を保っていることが示された。
参考スコア（独自算出の注目度）: 21.09987641039239
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real life high-stakes domains and general settings, and we find severe failures on specific subcategories across eight reward models. To mitigate these failures, we propose HARVE, a training-free reward-head editing method for scalar reward models. Instead of fine-tuning the reward model, HARVE identifies a multi-directional hacking subspace from residual stream directions associated with selected hacking subcategories, and removes the component of the reward-head vector aligned with that subspace. This directly reduces the reward head's sensitivity to hacking-related features using only a small set of contrastive gold-hacked examples, without gradient updates or fine-tuning. Comprehensive experiments across eight reward models indicates that \model improves hacking robustness, outperforms fine-tuning baselines, and preserves reward-models' general capability. Further analyses suggest that reward hacking is better captured as a multidimensional residual-space structure than by isolated surface cues.
Abstract（参考訳）: リワードモデルは、大きな言語モデル(LLM)アライメントの中心であるが、ハックに対する報酬には弱いままである。報奨モデルロバスト性を評価するために,実生活のハイステイクドメインと一般的な設定を含む13の報奨ハックパターンを含むRewardHackBenchを導入し、8つの報奨モデルにまたがる特定のサブカテゴリに深刻な障害を見出した。これらの障害を軽減するために,スカラー報酬モデルのためのトレーニング不要報酬ヘッド編集法であるHARVEを提案する。報酬モデルを微調整する代わりに、HARVEは選択されたハッキングサブカテゴリに関連する残留ストリーム方向から、多方向ハッキングサブスペースを特定し、そのサブスペースに整合した報酬ヘッドベクトルのコンポーネントを除去する。これにより、グラデーションの更新や微調整なしに、小さなコントラストのある金ハック例だけを使用して、ハッキングに関連する機能に対する報酬ヘッドの感度を直接的に低下させる。 8つの報酬モデルの総合的な実験は、モデルがハッキングの堅牢性を改善し、微調整ベースラインを上回り、報酬モデルの一般的な能力を保っていることを示している。さらなる分析により、報酬ハッキングは、孤立した表面キューよりも多次元の残留空間構造として捕えられることが示唆された。

論文の概要: HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

関連論文リスト