Fugu-MT 論文翻訳(概要): Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

論文の概要: Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

arxiv url: http://arxiv.org/abs/2605.08037v1
Date: Fri, 08 May 2026 17:26:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.24101
Title: Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
Title（参考訳）: ペアを超えて:あなたの言語モデルは、ひそかに優先順位グラフを最適化している
Authors: Ning Liu, Chuanneng Sun, Kristina Klinkner, Shervin Malmasi,
Abstract要約: ロールアウトランキングによって誘導される有向非巡回選好グラフを演算するDPOの原理的一般化を提案する。 GraphDPOはエッジとして支配関係を符号化し、グラフ構造化されたPlackett--Luce-インスパイアされた目的を最適化する。推論とプログラムタスクの実験は優れた性能を示し、グラフ構造化された嗜好モデリングは、ペアワイドおよびリストワイドのアライメント目的に対するスケーラブルで堅牢な代替手段であることを示している。
参考スコア（独自算出の注目度）: 17.030746750590758
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.
Abstract（参考訳）: 直接選好最適化(DPO)は、言語モデルをペアの選好比較を用いて整列させ、人間からのフィードバックから強化学習(RL)に代わるシンプルで効果的な代替手段を提供する。しかし、多くの実践的な設定では、トレーニングデータはプロンプト毎に複数のロールアウトで構成されており、ペアのDPOが利用できないリッチな好み構造が引き起こされる。このようなデータを独立したペアにまとめることで、推移性を捨て、冗長あるいは矛盾する監視を導入し、不安定な最適化につながる可能性がある。ロールアウトランキングによって誘導される有向非巡回選好グラフを演算するDPOの原理的一般化であるグラフダイレクト選好最適化(GraphDPO)を提案する。 GraphDPOはエッジとして支配関係を符号化し、グラフに構造化されたPlackett-Luce-インスパイアされた目的を最適化する。離散信号やスパース信号を扱うために,グラフ層を同一に選好する応答がグラフ層を形成し,層内エッジがゼロ損失に寄与し,突発的な勾配を防止できる同値クラス構成を導入する。グラフ構造をフル活用しているにもかかわらず、GraphDPOは効率的なlog-sum-expアグリゲーションを通じて、線形な1プロンプト毎の複雑性を維持している。さらに,検証された解を支配ノードとして挿入し,早期訓練を安定させながら,徐々にオラクルの監督を緩和するアニール型スケジュールを適用することで,任意の地道アンカーを組み込む。推論およびプログラム合成タスクの実験は優れた性能を示し、グラフ構造化された嗜好モデリングは、ペアワイドおよびリストワイドのアライメント目的に対するスケーラブルで堅牢な代替手段であることを示唆している。

論文の概要: Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

関連論文リスト