Fugu-MT 論文翻訳(概要): Reward Hacking as Equilibrium under Finite Evaluation

論文の概要: Reward Hacking as Equilibrium under Finite Evaluation

arxiv url: http://arxiv.org/abs/2603.28063v1
Date: Mon, 30 Mar 2026 06:06:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.249093
Title: Reward Hacking as Equilibrium under Finite Evaluation
Title（参考訳）: エクイリビリウムとしてのリワードハッキング : ファイナンシャル・アセスメント
Authors: Jiacheng Wang, Jinbin Huang,
Abstract要約: 5つの最小公理の下では、最適化されたAIエージェントは、評価システムによってカバーされない品質の次元において、体系的に過小評価される。この結果は、修正可能なバグではなく、構造平衡として報酬ハックを確立します。我々は、部分的な形式分析により、エージェントが評価システム内のゲームから評価システム自体を積極的に劣化させる機能しきい値の存在を予想する。
参考スコア（独自算出の注目度）: 4.0834639890017295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems -- the known, differentiable architecture of reward models -- to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows -- because quality dimensions expand combinatorially while evaluation costs grow at most linearly per tool -- so that hacking severity increases structurally and without bound. Our results unify the explanation of sycophancy, length gaming, and specification gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. We further conjecture -- with partial formal analysis -- the existence of a capability threshold beyond which agents transition from gaming within the evaluation system (Goodhart regime) to actively degrading the evaluation system itself (Campbell regime), providing the first economic formalization of Bostrom's (2014) "treacherous turn."
Abstract（参考訳）: 我々は、多次元品質、有限評価、効果的な最適化、リソースの有限性、組合せ相互作用の5つの最小公理の下で、最適化されたAIエージェントは、その評価システムによってカバーされない品質の次元において、体系的に過小評価されるであろうことを証明した。この結果は、修正可能なバグではなく構造平衡として報酬ハッキングを確立し、特定のアライメント方法(RLHF、DPO、コンスティチューショナルAIなど)や、採用される評価アーキテクチャに関係なく保持する。我々のフレームワークは、HolmstromとMilgrom(1991)のマルチタスクプリンシパルエージェントモデルをAIアライメント設定でインスタンス化するが、AIシステム特有の構造的特徴(報奨モデルの既知で微分可能なアーキテクチャ)を利用して、デプロイ前の各品質次元におけるハッキングの方向と重大性を予測する計算可能な歪み指数を導出する。さらに、クローズド推論からエージェントシステムへの移行は、ツール数の増加に伴って評価カバレッジがゼロに低下することを示します。以上の結果から,単一理論構造下での梅毒,長さゲーム,仕様ゲームの説明を統一し,動作可能な脆弱性評価方法を得た。さらに、部分的な形式分析により、エージェントが評価システム(Goodhart regime)内のゲームから評価システム自体(Campbell regime)を積極的に劣化させる機能しきい値の存在を予想し、ボストロムの2014年の最初の経済的なフォーマル化を「悲劇的なターン」とした。

論文の概要: Reward Hacking as Equilibrium under Finite Evaluation

関連論文リスト