Fugu-MT 論文翻訳(概要): Murphys Laws of AI Alignment: Why the Gap Always Wins

論文の概要: Murphys Laws of AI Alignment: Why the Gap Always Wins

arxiv url: http://arxiv.org/abs/2509.05381v3
Date: Mon, 15 Sep 2025 06:39:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-16 13:19:47.943402
Title: Murphys Laws of AI Alignment: Why the Gap Always Wins
Title（参考訳）: マーフィーのAIアライメントの法則:なぜいつも勝つのか
Authors: Madhava Gaikwad,
Abstract要約: 我々は,不特定性の下での人間のフィードバックからの強化学習について検討した。フィードバックがバイアス強度エプシロンを持つ文脈のごく一部に偏りがある場合、任意の学習アルゴリズムは2つの可能な「真の」報酬関数を区別するために指数関数的に多くのサンプルexp(n*alpha*epsilon2)を必要とする。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study reinforcement learning from human feedback under misspecification. Sometimes human feedback is systematically wrong on certain types of inputs, like a broken compass that points the wrong way in specific regions. We prove that when feedback is biased on a fraction alpha of contexts with bias strength epsilon, any learning algorithm needs exponentially many samples exp(n*alpha*epsilon^2) to distinguish between two possible "true" reward functions that differ only on these problematic contexts. However, if you can identify where feedback is unreliable (a "calibration oracle"), you can focus your limited questions there and overcome the exponential barrier with just O(1/(alpha*epsilon^2)) queries. This quantifies why alignment is hard: rare edge cases with subtly biased feedback create an exponentially hard learning problem unless you know where to look. The gap between what we optimize (proxy from human feedback) and what we want (true objective) is fundamentally limited by how common the problematic contexts are (alpha), how wrong the feedback is there (epsilon), and how much the true objectives disagree there (gamma). Murphy's Law for AI alignment: the gap always wins unless you actively route around misspecification.
Abstract（参考訳）: 我々は,不特定性の下での人間のフィードバックからの強化学習について検討した。場合によっては、特定の領域で間違った方向を向いているコンパスが壊れているような、ある種の入力に対して、人間のフィードバックが体系的に間違っている場合もあります。バイアス強度エプシロンを持つ文脈のごく一部にフィードバックが偏った場合、任意の学習アルゴリズムは指数関数的に多くのサンプル exp(n*alpha*epsilon^2) を必要とし、これらの問題のある文脈でのみ異なる2つの「真の」報酬関数を区別する。しかし、フィードバックが信頼できない場所("キャリブレーション・オラクル")を特定できれば、限られた質問に集中して、O(1/(alpha*epsilon^2))クエリだけで指数関数的障壁を克服できます。微妙にバイアスのかかったフィードバックを持つ稀なエッジケースは、どこを見るべきかを知らない限り、指数関数的にハードラーニングの問題を生み出します。最適化するもの(人間からのフィードバックからプロキシ)と私たちが望むもの(真の目的)のギャップは、問題のあるコンテキストがどの程度一般的であるか(アルファ)、フィードバックがどれほど間違っているか(エプシロン)、真の目的がどの程度矛盾しているか(ガンマ)によって根本的に制限されます。 Murphy氏のAIアライメントに関する法則: ミス特定を積極的に回避しない限り、ギャップは常に勝利します。

論文の概要: Murphys Laws of AI Alignment: Why the Gap Always Wins

関連論文リスト