Fugu-MT 論文翻訳(概要): Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

論文の概要: Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

arxiv url: http://arxiv.org/abs/2605.20834v1
Date: Wed, 20 May 2026 07:26:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.552657
Title: Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
Title（参考訳）: DPOとRLHFの条件等価性:暗黙の仮定, 故障モード, 予測アライメント
Authors: Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo,
Abstract要約: RLHF(Reinforcement Learning from Human Feedback)の代替としてDPO(Direct Preference Optimization)が登場している。このような場合、DPOとRLHFは基本的に異なる目的を最適化する。本稿では,制約付き制約付きRLHF(Constrained Preference Optimization, CPO)を導入する。我々の理論的分析は、DPOの保証が保たれ、証明可能なアライメントで単純さを保つソリューションを提供するときに成立する。
参考スコア（独自算出の注目度）: 51.18269946911088
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.
Abstract（参考訳）: Reinforcement Learning from Human Feedback (RLHF) の代替として、DPO (Direct Preference Optimization) が登場し、より単純な実装と理論的等価性を提供する。我々は、この同値性は普遍的ではなく条件的であることを証明し、実際にしばしば違反される暗黙の仮定に依存する: RLHF最適化ポリシーは、人間優先の応答を優先しなければならない。この仮定が失敗すると、DPOは人間の嗜好と絶対的に一致するのではなく、参照ポリシーに対する相対的な優位性を最適化する。この仮定が破られたときに特徴付け、望ましくない解空間の存在を示し、DPOとRLHFがそのような場合において根本的に異なる目的を最適化していることを証明する。そこで本研究では,制約付き制約付きRLHF(Constrained Preference Optimization, CPO)を導入する。さらに、ソフトマージンランキングによる幾何学的解釈を行い、DPOが潜在的に負の目標を持つマージンランキングを実装していることを明らかにした。我々の理論的分析は、DPOの保証が保たれ、証明可能なアライメントで単純さを保つソリューションを提供するときに成立する。標準ベンチマークに関する総合的な実験は、CPOが最先端のパフォーマンスを達成することを示す。コードは、https://github.com/visitworld123/CPO.comで入手できる。

論文の概要: Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

関連論文リスト