Fugu-MT 論文翻訳(概要): Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

論文の概要: Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

arxiv url: http://arxiv.org/abs/2602.06788v1
Date: Fri, 06 Feb 2026 15:45:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-09 22:18:26.453913
Title: Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences
Title（参考訳）: 非凸$f$-divergencesを持つDPOの変位抵抗拡張
Authors: Idan Pipano, Shoham Sabach, Kavosh Asadi, Mohammad Ghavamzadeh,
Abstract要約: DPOと関連するアルゴリズムは、RLHFの目的を直接最適化することで言語モデルを調整する。本稿では,RLHF問題におけるDPO誘導特性について述べる。次に、特定のDPO誘導および変位抵抗$f$に集中し、新しいSquaredPO損失につながった。
参考スコア（独自算出の注目度）: 23.894803166231792
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.
Abstract（参考訳）: DPOと関連するアルゴリズムは、RLHFの目的を直接最適化することで言語モデルを調整する。元の問題は、KL の発散が、凸生成関数 $f$ で$f$-発散の族に置き換わっても、引き分けられる。私たちの最初の貢献は、$f$の凸性は必須ではないことを示すことです。代わりに、DPO誘導と呼ばれるより一般的な条件を同定し、RLHF問題を抽出可能なときに正確に特徴付ける。次のコントリビューションは、確率変位を防ぐために必要となる$f$の第二条件を確立することであり、これは、勝者と敗者の反応の確率がゼロに近づく既知の経験的現象である。この条件を満たす任意の$f$を、変位耐性として参照する。最終的に、特定のDPO誘導および変位抵抗$f$に集中し、新しいSquaredPO損失につながった。 DPOと比較して、この新しい損失は、実際に競争的に実行しながら、より強力な理論的保証を提供する。

論文の概要: Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

関連論文リスト