Fugu-MT 論文翻訳(概要): A Markov Chain Approach to Preference Alignment

論文の概要: A Markov Chain Approach to Preference Alignment

arxiv url: http://arxiv.org/abs/2606.22652v1
Date: Sun, 21 Jun 2026 19:56:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:10:55.784833
Title: A Markov Chain Approach to Preference Alignment
Title（参考訳）: マルコフ連鎖の選好アライメントへのアプローチ
Authors: Takuya Koriyama, Tengyuan Liang,
Abstract要約: MCHFは、モデル出力の遷移メカニズムを定義するために、ペアワイズな選好を直接使用する。 MCHFは静止分布に幾何的に収束することを示す。また、NLHFのミラーディフレッシュアルゴリズムは、類似構造適応収束保証を満たすことを示す。
参考スコア（独自算出の注目度）: 5.822529963339041
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Markov Chain from Human Feedback (MCHF), an elementary approach for aligning generative models from pairwise human preferences. Unlike Reinforcement Learning from Human Feedback (RLHF), which reduces comparisons to a scalar reward, and Nash Learning from Human Feedback (NLHF), which preserves pairwise utilities through a KL-regularized minimax optimization, MCHF uses pairwise preferences directly to define a transition mechanism over model outputs. Given a pairwise utility $U(x,y)$, which quantifies human preference for $y$ over $x$, and a reference probability distribution $μ_{\mathsf{ref}}$, we define a Markov kernel $\mathsf{P}(x, dy)\propto \exp(U(x,y))μ_{\mathsf{ref}}(dy)$, and take the Markov chain starting from $μ_{\mathsf{ref}}$ as an iterative alignment procedure. We show that MCHF converges geometrically fast to the stationary distribution, with a convergence rate governed by the seminorm $\|U\|_\oplus=\inf_{g,f\in L^\infty(μ_{\mathsf{ref}})}\|U-g\oplus f\|_\infty$, which quantifies the non-transitive structure of the pairwise utility. We further show that a mirror-descent algorithm for NLHF satisfies an analogous structure-adaptive convergence guarantee. Finally, through a perturbation analysis, we prove that when $\|U\|_\oplus$ is small, MCHF and NLHF agree up to first order around an RLHF solution, which yields a unified view of reward-based, game-theoretic, and Markovian approaches to alignment. In particular, for two natural algorithms that converge to the MCHF/NLHF equilibria, we show that the first step of MCHF and NLHF recovers the RLHF solution based on the column-sum reward $\hat{f}(y)=\int μ_{\mathsf{ref}}(dx) U(x, y)$, and starting from the second iteration, both algorithms incorporate the same linear functional of the residual $U-(-\hat f)\oplus \hat f$, which captures the non-transitive structure of the pairwise utility $U$.
Abstract（参考訳）: 人間フィードバック(Human Feedback, MCHF)のマルコフ・チェイン(Markov Chain)を提案する。 RLHF(Reinforcement Learning from Human Feedback)は、スカラー報酬との比較を減らし、NLHF(Nash Learning from Human Feedback)は、KL正規化ミニマックス最適化を通じてペアワイズユーティリティを保存するが、MCHFはペアワイズ好みを直接使用してモデル出力の遷移メカニズムを定義する。 a pairwise utility $U(x,y)$, which Quantates human preference for $y$ over $x$, and a reference probability distribution $μ_{\mathsf{ref}}$, we define a Markov kernel $\mathsf{P}(x, dy)\propto \exp(U(x,y))μ_{\mathsf{ref}}(dy)$, and take the Markov chain from a $μ_{\mathsf{ref}}$。 MCHF は、半ノルム $\|U\|_\oplus=\inf_{g,f\in L^\infty(μ_{\mathsf{ref}})}\|U-g\oplus f\|_\infty$ で支配される収束率で、静止分布に幾何的に早く収束することを示す。さらに、NLHFのミラーディフレッシュアルゴリズムは、類似構造適応収束保証を満たすことを示す。最後に、摂動解析により、$\|U\|_\oplus$ が小さいとき、MCHF と NLHF は RLHF 解の周りの一階に一致し、報酬ベース、ゲーム理論、マルコフ的アプローチの統一的な見方をもたらす。特に、MCHF/NLHF平衡に収束する2つの自然アルゴリズムに対して、MCHF と NLHF の最初のステップは、カラムサムの報酬 $\hat{f}(y)=\int μ_{\mathsf{ref}}(dx) U(x, y)$ に基づいて RLHF の解を復元し、2回目の反復から、2つのアルゴリズムは、残余の $U-(-\hat f)\oplus \hat f$ の線型汎函数を組み入れ、ペアワイズユーティリティ $U$ の非推移構造を捉える。

論文の概要: A Markov Chain Approach to Preference Alignment

関連論文リスト