Fugu-MT 論文翻訳(概要): Theoretical Limits of Language Model Alignment

論文の概要: Theoretical Limits of Language Model Alignment

arxiv url: http://arxiv.org/abs/2605.07105v1
Date: Fri, 08 May 2026 01:32:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.720627
Title: Theoretical Limits of Language Model Alignment
Title（参考訳）: 言語モデルアライメントの理論的限界
Authors: Lucas Monteiro Paes, Natalie Mackraz, Barry-John Theobald, Federico Danieli,
Abstract要約: 言語モデル(LM)アライメントは、ベースモデルの能力を保ちながら、人間の好みを反映するモデル出力を改善する。最も一般的なアライメントアプローチは、(i)強化学習であり、KL分割制約の下で期待される報酬を最大化する。固定KL分割予算に対する最大期待報酬利得を導出することにより、KL正規化アライメントの情報理論的限界を特徴づける。
参考スコア（独自算出の注目度）: 9.45142272392467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.
Abstract（参考訳）: 言語モデル(LM)アライメントは、ベースモデルの能力を保ちながら、人間の好みを反映するモデル出力を改善する。最も一般的なアライメントアプローチは i)KL分割制約の下で期待される報酬を最大化する強化学習、及び (ii)$N$のアライメントは$N$の独立なサンプルの中で最も高い逆出力を選択する。広く使われているにもかかわらず、KL予算の下での報酬改善の基本的な限界は理解されていない。固定KL分割予算に対する最大期待報酬利得を導出することにより、KL正規化アライメントの情報理論的限界を特徴づける。最初の結果は、事前解析で使われる$\sqrt{\texttt{KL}}$ではなく、ジェフリーズ発散項によって支配される最適報酬改善のためのクローズドフォーム表現を提供する。さらに、この表現をベースモデルに基づく共分散として再構成し、ベースモデル単体で達成可能なアライメントゲインを予測する実用的な推定値を得る。分析結果をプロキシ報酬設定に拡張し、理想とプロキシアライメント(リワードハッキング)のギャップは報酬誤差の程度とKLペナルティ係数の減少とともに増大することを示した。そして、報酬のアンサンブルが報酬のハッキングを軽減し、実際に使用されるこのテクニックの理論的正当性を証明した。実験的に、LMの安全性と要約の2つのタスクに対してKL-reward Paretoフロンティアを計算し、PPOとGRPOが実質的に準最適であるのに対して、N$のベスト・オブ・N$が理論上の限界に近づいたことを示す。理論的には、アライメントの文献で観測されたいくつかの現象に光を当て、高い推論コストを伴わずに最適なアライメントを実現するためにアルゴリズムの改善が必要であることを示唆している。

論文の概要: Theoretical Limits of Language Model Alignment

関連論文リスト