Fugu-MT 論文翻訳(概要): Mind the Gap: Structure-Aware Consistency in Preference Learning

論文の概要: Mind the Gap: Structure-Aware Consistency in Preference Learning

arxiv url: http://arxiv.org/abs/2604.27733v1
Date: Thu, 30 Apr 2026 11:24:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.064174
Title: Mind the Gap: Structure-Aware Consistency in Preference Learning
Title（参考訳）: マインド・ザ・ギャップ(Mind the Gap) - 優先度学習における構造意識の整合性
Authors: Mehryar Mohri, Yutao Zhong,
Abstract要約: 嗜好学習は、大規模言語モデルと人間の意図との整合の基礎となっている。ニューラルネットワークに典型的な等連続仮説集合に対して、標準代理は理論的に矛盾することを示す。分離マージンの強制に依存する厳格な$H$一貫性境界を導出する。我々はこれをStructure-Aware $H$-consistencyに拡張し、同義語とハードペアを扱うための応答間の意味的距離に基づいてマージンを適応する新しい目的(SA-DPO)を導入する。
参考スコア（独自算出の注目度）: 42.67092904252001
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference learning has become the foundation of aligning Large Language Models (LLMs) with human intent. Popular methods, such as Direct Preference Optimization (DPO), minimize surrogate losses as proxies for the intractable pairwise ranking loss. However, we demonstrate that for the equicontinuous hypothesis sets typical of neural networks, these standard surrogates are theoretically inconsistent, yielding vacuous generalization guarantees. To resolve this, we formulate LLM alignment within a margin-shifted ranking framework. We derive rigorous $H$-consistency bounds that depend on enforcing a separation margin $γ$. Crucially, we extend this to Structure-Aware $H$-consistency, introducing a novel objective (SA-DPO) that adapts the margin based on the semantic distance between responses to handle synonyms and hard pairs. Finally, we analyze the trade-off between consistency and model limitations via the Margin-Capacity Profile, proving that heavy-tailed surrogates (such as the Polynomial Hinge family) offer superior consistency guarantees for capacity-bounded models compared to the standard logistic loss used in DPO.
Abstract（参考訳）: 優先度学習は、大規模言語モデル(LLM)と人間の意図との整合の基礎となっている。直接選好最適化(DPO)のような一般的な手法では、サロゲート損失を、難解なペアワイズランキング損失のプロキシとして最小化する。しかし、ニューラルネットワークの典型的な等連続仮説集合に対して、これらの標準代理は理論的に矛盾し、空の一般化を保証することを実証する。これを解決するために、マージンシフトしたランキングフレームワーク内でLLMアライメントを定式化する。厳密な$H$-一貫性境界を導出し、分離マージンを$γ$とする。重要なことに、我々はこれをStructure-Aware $H$-consistencyに拡張し、同義語とハードペアを扱うための応答間の意味的距離に基づいてマージンを適応する新しい目的(SA-DPO)を導入する。最後に、Margin-Capacity Profileを介して、一貫性とモデル制限の間のトレードオフを分析し、重い尾を持つサロゲート(例えばポリノミアルヒンジファミリー)がDPOで使用される標準的なロジスティック損失と比較してキャパシティバウンドモデルに対して優れた一貫性を保証することを証明した。

論文の概要: Mind the Gap: Structure-Aware Consistency in Preference Learning

関連論文リスト