Fugu-MT 論文翻訳(概要): Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

論文の概要: Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

arxiv url: http://arxiv.org/abs/2604.13602v1
Date: Wed, 15 Apr 2026 08:11:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.447206
Title: Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Title（参考訳）: 大規模モデル時代におけるリワードハック - メカニズム,創発的ミス,課題
Authors: Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng, Xuanjing Huang,
Abstract要約: 本稿では、報酬ハッキングを理解するための統一フレームワークとして、PCH(Proxy Compression hypothesis)を提案する。この観点では、報酬のハッキングは、客観的圧縮、最適化増幅、評価器-政治共適応の相互作用から生じる。この視点は、RLHF、RLAIF、RLVR体制をまたいだ経験的現象を統一し、局所的ショートカット学習がより広範な誤認識へと一般化する方法について説明している。
参考スコア（独自算出の注目度）: 87.04241991512386
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.
Abstract（参考訳）: Reinforcement Learning from Human Feedback (RLHF) と関連するアライメントパラダイムは、大規模言語モデル (LLM) とマルチモーダル大規模言語モデル (MLLM) を人間に推奨される行動に向けて操る中心となっている。モデルでは、学習した報酬信号の欠陥を利用して、真のタスク意図を満たすことなく、プロキシの目的を最大化する。モデルのスケールと最適化が強化されるにつれて、冗長性バイアス、梅毒、幻覚的正当化、ベンチマークオーバーフィッティング、マルチモーダルな設定では、認識の分離と評価の操作が現れる。最近の証拠は、一見良心的なショートカットの振る舞いが、詐欺や、監視機構の戦略的なゲームなど、より広い形のミスアライメントへと一般化できることを示唆している。本稿では、報酬ハッキングを理解するための統一フレームワークとして、プロキシ圧縮仮説(PCH)を提案する。我々は、高次元の人間目的の圧縮報酬表現に対する表現的ポリシーを最適化する突発的な結果として報酬ハッキングを形式化する。この観点では、報酬のハッキングは、客観的圧縮、最適化増幅、評価器-政治共適応の相互作用から生じる。この視点は、RLHF、RLAIF、RLVR体制全体にわたる経験的現象を統一し、局所的ショートカット学習が、詐欺や監視機構の戦略的な操作を含む、より広い形のミスアライメントへと一般化する方法について説明している。さらに、圧縮、増幅、共適応のダイナミクスにどのように介入するかに応じて、検出と緩和の戦略を整理する。大規模なプロキシベースのアライメントの構造的不安定性として報酬のハッキングをフレーミングすることで、スケーラブルな監視、マルチモーダルグラウンド、エージェント自律性におけるオープンな課題を強調します。

論文の概要: Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

関連論文リスト