Fugu-MT 論文翻訳(概要): IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

論文の概要: IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

arxiv url: http://arxiv.org/abs/2602.19416v1
Date: Mon, 23 Feb 2026 01:14:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-24 17:42:02.633437
Title: IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
Title（参考訳）: IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
Authors: Mohammad Beigi, Ming Jin, Junshan Zhang, Jiaxin Zhang, Qifan Wang, Lifu Huang,
Abstract要約: Reinforcement Learning from Human Feedback (RLHF)は強力なLDMアライメントを実現するが、報酬ハッキングを導入することができる。 IR3(Interpretable Reward Reconstruction and Rectification)は,RLHFモデルを用いた暗黙的目標をリバースエンジニアリングし,解釈し,外科的に修復するフレームワークである。我々は、IR3が地道報酬と0.89の相関を達成し、90%以上の精度でハッキング機能を識別し、元のモデルの3%以内の機能を維持しながら、ハッキングの挙動を著しく低減することを示した。
参考スコア（独自算出の注目度）: 67.20568716300272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking - models exploit spurious correlations in proxy rewards without genuine alignment. Compounding this, the objectives internalized during RLHF remain opaque, making hacking behaviors difficult to detect or correct. We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models. We propose Contrastive Inverse Reinforcement Learning (C-IRL), which reconstructs the implicit reward function by contrasting paired responses from post-alignment and baseline policies to explain behavioral shifts during RLHF. We then decompose the reconstructed reward via sparse autoencoders into interpretable features, enabling identification of hacking signatures through contribution analysis. Finally, we propose mitigation strategies - clean reward optimization, adversarial shaping, constrained optimization, and feature-guided distillation - that target problematic features while preserving beneficial alignment. Experiments across multiple reward model configurations show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
Abstract（参考訳）: Reinforcement Learning from Human Feedback (RLHF)は強力なLLMアライメントを実現するが、報酬ハックを導入することができる。これにより、RLHF中に内部化された目的は不透明であり、ハッキング動作の検出や修正が困難になる。 IR3(Interpretable Reward Reconstruction and Rectification)は,RLHFモデルを用いた暗黙的目標をリバースエンジニアリングし,解釈し,外科的に修復するフレームワークである。本稿では,RLHFにおける行動変化を説明するために,適応後および基本方針からのペア応答を対比することにより,暗黙の報酬関数を再構築するコントラスト逆強化学習(C-IRL)を提案する。次に、スパースオートエンコーダを介して再構成された報酬を解釈可能な機能に分解し、コントリビューション分析によりハッキング署名の識別を可能にする。最後に, 適正なアライメントを維持しつつ, 問題のある特徴を目標としつつ, クリーン報酬最適化, 逆変換, 制約付き最適化, 機能誘導蒸留といった緩和戦略を提案する。複数の報酬モデル構成に対する実験では、IR3は0.89の相関性を持ち、90%以上の精度でハッキング機能を識別し、元のモデルの3%以内の能力を保ちながらハッキングの挙動を著しく低減している。

関連論文リスト

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking [69.06218054848803]
本稿では,報酬ハッキングを動的かつ競争的なゲームとして再認識するフレームワークであるAdrial Reward Auditing(ARA)を提案する。まず、ハッカーポリシーは報酬モデルの脆弱性を発見し、監査人は潜伏表現からのエクスプロイトを検出することを学習する。 ARAはすべてのベースラインの中で最高のアライメントユーティリティトレードオフを実現しています。
論文参考訳（メタデータ） (2026-02-02T07:34:57Z)
From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
検証基準ベース報酬(RLVRR)を用いた強化学習を提案する。最後の答えをチェックする代わりに、RLVRRは高品質な参照(すなわち報酬連鎖)から順序付けられた言語信号を抽出する。このようにして、RLVRRは報酬を2つの次元に分解する。
論文参考訳（メタデータ） (2026-01-26T14:39:58Z)
Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking [13.417125511014447]
そこで本稿では,人為的に規定された代行報酬関数を,優先事項から付加的かつ遷移依存的な補正項を学習することで修復する自動フレームワークを提案する。 PBRRは、好みから報酬関数をスクラッチから学習するベースラインを一貫して上回り、他のアプローチを使用してプロキシ報酬関数を変更する。
論文参考訳（メタデータ） (2025-10-14T23:18:24Z)
IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards [22.802937805177773]
Instruct following Decorator(IFDecorator)は、RLVRトレーニングを堅牢でサンプル効率のよいパイプラインにラップするフレームワークである。我々のQwen2.5-32B-Instruct-IFDecoratorはIFEvalで87.43%の精度を達成し、GPT-4oのようなより大きなプロプライエタリモデルよりも優れている。私たちのトリップワイヤは、報酬のハッキング率を大幅に低下させています。
論文参考訳（メタデータ） (2025-08-06T17:00:54Z)
Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction [5.518813485456855]
外部推論システムは、言語モデルとプロセス報酬モデル(PRM)を組み合わせて、複雑なタスクのための高品質な推論パスを選択する。これらのシステムはハッキングに報いる傾向があり、高いスコアが与えられるが、論理的に正しくないパスは、PRMによって高いスコアが割り当てられ、誤った答えが導かれる。推論経路の真の報酬を推定することにより、報酬ハッキングを緩和する手法であるCausal Reward Adjustment (CRA)を提案する。
論文参考訳（メタデータ） (2025-08-06T08:48:55Z)
Reward Shaping to Mitigate Reward Hacking in RLHF [47.71454266800376]
Preference As Reward (PAR) は、報酬モデルに埋め込まれた潜在的嗜好を強化学習の信号として活用する新しいアプローチである。 AlpacaEval 2.0ベンチマークでは、PARは競合するアプローチよりも少なくとも5パーセント高い勝利率を達成する。
論文参考訳（メタデータ） (2025-02-26T02:57:59Z)
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward Hacking(報酬の過度な最適化)は依然として重要な課題だ。本稿では,報奨モデル,すなわちInfoRMのためのフレームワークを提案する。 InfoRMの過度な最適化検出機構は、有効であるだけでなく、幅広いデータセットにわたって堅牢であることを示す。
論文参考訳（メタデータ） (2024-02-14T17:49:07Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
報酬関数と人間の嗜好の相違は、現実世界で破滅的な結果をもたらす可能性がある。近年の手法は、人間の嗜好から報酬関数を学習することで、不適応を緩和することを目的としている。本稿では,ロボットRLHFフレームワークにおける報酬正規化の新たな概念を提案する。
論文参考訳（メタデータ） (2023-12-22T04:56:37Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。