Fugu-MT 論文翻訳(概要): AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

論文の概要: AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

arxiv url: http://arxiv.org/abs/2605.28440v1
Date: Wed, 27 May 2026 13:05:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.064726
Title: AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates
Title（参考訳）: AdaDPO: バランスの取れたグラディエントアップデートによる自己適応型ダイレクト推論最適化
Authors: Shaolong Chen, Madalina Ciobanu, Qingqing Mao, Ritankar Das,
Abstract要約: 本稿では,DPOアルゴリズムの自己適応的変種を提案する。 AdaDPOは、好ましくない確率と好ましくない確率の勾配の等級を強制するために構築される。損失レベルで純粋に動作するため、AdaDPOは既存の好みベースのアライメントパイプラインにドロップすることができる。
参考スコア（独自算出の注目度）: 0.03999851878220877
License: http://creativecommons.org/licenses/by/4.0/
Abstract: DPO has become a widely adopted alternative to RLHF for aligning LLMs with human preferences, eliminating the need for a separate reward model or RL loop. Recent theoretical analysis uncovers an asymmetric gradient behavior in DPO: the loss suppresses dispreferred responses substantially faster than it promotes preferred ones, causing the model to learn to avoid bad answers rather than to generate good ones. We propose AdaDPO, a Self-Adaptive variant of the DPO algorithm that introduces per-preference-pair, stop-gradient-based coefficients derived directly from the policy model's generation probabilities, with the reference model's probabilities as an optional component. AdaDPO is constructed to enforce equality of gradient magnitudes between preferred and dispreferred probabilities; the practical implementation balances per-token gradients and applies a numerical clipping bound for stability, while retaining DPO's original hyperparameter structure. On Llama-3-8B-Instruct trained on UltraFeedback under a SimPO similar setup, AdaDPO consistently outperforms DPO on AlpacaEval 2: it achieves higher length-controlled win rates (LC) in 81% of hyperparameter combinations, attains the global best LC (48.3%) and raw win rate (46.1%), and enlarges the LC-over-WR margin in 88% of combinations, indicating effective mitigation of length bias. Additional analyses on KL divergence, reward margin, and reward accuracy confirm that AdaDPO rectifies the gradient imbalance and yields more efficient optimization. Because it operates purely at the loss level, AdaDPO can be dropped into existing preference-based alignment pipelines without changing data collection or model architectures. The method requires only a few lines of code, and the same self-adaptive principle generalizes to a broad family of pairwise contrastive preference losses including SimPO, R-DPO, IPO, CPO, and ORPO.
Abstract（参考訳）: DPOは、LLMを人間の好みに合わせるためにRLHFの代わりに広く採用され、別の報酬モデルやRLループの必要性を排除している。最近の理論分析では、DPOの非対称な勾配挙動が明らかにされている: 損失は、好ましくない応答を推奨する応答よりもかなり早く抑制し、良い応答を生成するのではなく、悪い答えを避けることを学習する。本稿では,DPO アルゴリズムの自己適応型である AdaDPO を提案する。 AdaDPOは、好ましくも好ましくない確率間の勾配等級の等級を強制するために構築されており、実際の実装では、DPOの元々のハイパーパラメータ構造を維持しながら、トーケンの勾配ごとのバランスを保ち、安定性のために数値的なクリッピングを施している。 SimPOと同様のセットアップでUltraFeedbackでトレーニングされたLlama-3-8B-Instructでは、AdaDPOはAlpacaEval 2で常にDPOよりも優れており、ハイパーパラメータの組み合わせの81%でより高い長さ制御された勝利率(LC)を獲得し、世界最高のLC(48.3%)と生の勝利率(46.1%)を獲得し、LC-over-WRマージンを88%で拡大し、長さバイアスの効果的緩和を示す。 KL分散、報酬マージン、報酬精度に関するさらなる分析により、AdaDPOが勾配不均衡を補正し、より効率的な最適化をもたらすことが確認された。損失レベルで純粋に動作するため、データ収集やモデルアーキテクチャを変更することなく、AdaDPOを既存の好みベースのアライメントパイプラインにドロップすることができる。この方法はほんの数行のコードしか必要とせず、同じ自己適応原理は、SimPO、R-DPO、IPO、CPO、ORPOなど、相互に対照的な選択肢の広いファミリーに一般化する。

論文の概要: AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

関連論文リスト