Fugu-MT 論文翻訳(概要): Murphys Laws of AI Alignment: Why the Gap Always Wins

論文の概要: Murphys Laws of AI Alignment: Why the Gap Always Wins

arxiv url: http://arxiv.org/abs/2509.05381v1
Date: Thu, 04 Sep 2025 23:03:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:03.483078
Title: Murphys Laws of AI Alignment: Why the Gap Always Wins
Title（参考訳）: マーフィーのAIアライメントの法則:なぜいつも勝つのか
Authors: Madhava Gaikwad,
Abstract要約: 大規模な言語モデルは、人間のフィードバックからの強化学習を通じて、人間の好みに合わせている。効果はあるものの、これらの手法は繰り返し発生する障害パターン、すなわち報酬のハッキング、薬効、注釈のドリフト、誤一般化を示す。本稿では、フィードバックに基づくアライメントにおける繰り返し失敗を理解するための統一レンズであるアライメントギャップの概念を紹介する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly aligned to human preferences through reinforcement learning from human feedback (RLHF) and related methods such as Direct Preference Optimization (DPO), Constitutional AI, and RLAIF. While effective, these methods exhibit recurring failure patterns i.e., reward hacking, sycophancy, annotator drift, and misgeneralization. We introduce the concept of the Alignment Gap, a unifying lens for understanding recurring failures in feedback-based alignment. Using a KL-tilting formalism, we illustrate why optimization pressure tends to amplify divergence between proxy rewards and true human intent. We organize these failures into a catalogue of Murphys Laws of AI Alignment, and propose the Alignment Trilemma as a way to frame trade-offs among optimization strength, value capture, and generalization. Small-scale empirical studies serve as illustrative support. Finally, we propose the MAPS framework (Misspecification, Annotation, Pressure, Shift) as practical design levers. Our contribution is not a definitive impossibility theorem but a perspective that reframes alignment debates around structural limits and trade-offs, offering clearer guidance for future design.
Abstract（参考訳）: 大規模言語モデルは、人間からのフィードバック(RLHF)と関連するメソッド(DPO)、コンスティチューショナルAI、RLAIF)の強化学習を通じて、人間の嗜好に適合する傾向にある。効果はあるものの、これらの手法は繰り返し発生する障害パターン、すなわち報酬のハッキング、薬効、注釈のドリフト、誤一般化を示す。本稿では、フィードバックに基づくアライメントにおける繰り返し失敗を理解するための統一レンズであるアライメントギャップの概念を紹介する。 KL-tiltingフォーマリズムを用いて、最適化圧力がプロキシ報酬と真の人間の意図のばらつきを増幅する理由を説明する。我々は、これらの失敗をAIアライメントのマーフィス法則のカタログに整理し、最適化強度、値キャプチャ、一般化の間のトレードオフをフレーム化する方法としてアライメント・トリレンマを提案する。小規模な実証的研究は、実証的な支援として機能する。最後に,MAPSフレームワーク(ミス仕様,アノテーション,圧力,シフト)を実用的な設計レバーとして提案する。私たちの貢献は決定的な不合理性定理ではなく、構造的限界とトレードオフに関する議論を整理し、将来の設計に対するより明確なガイダンスを提供するという視点です。

論文の概要: Murphys Laws of AI Alignment: Why the Gap Always Wins

関連論文リスト