Fugu-MT 論文翻訳(概要): DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

論文の概要: DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

arxiv url: http://arxiv.org/abs/2605.10863v1
Date: Mon, 11 May 2026 17:10:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 02:24:05.584059
Title: DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization
Title（参考訳）: DGPO: 方向性整合なグループワイズ最適化によるペアワイズ推論を超えて
Authors: Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Yulan Yuan, Zhijiang Guo, Wei Wang,
Abstract要約: 本稿では,グループレベルでの監視信号を集約し,方向対応アライメントを明示的にモデル化する軽量フレームワークであるDGPOを提案する。構築したリバースデータは5つのベンチマークで平均3.2%向上し、DGPOは複数のデータセットとモデルファミリで一貫したゲインを提供し、平均精度は3.6%向上した。
参考スコア（独自算出の注目度）: 17.28534525169732
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Large Language Models (LLMs) have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement across five benchmarks, while DGPO further delivers consistent gains across multiple datasets and model families, achieving average accuracy improvements of up to 3.6%.
Abstract（参考訳）: 大規模言語モデル(LLM)は目覚ましい進歩を遂げているが、現在の選好最適化手法は、推論の多様性を保ちながら方向性の整合性を調整するのに苦慮している。この制限に対処するため,グループレベルでの監視信号を集約し,多候補比較による方向対応アライメントを明示的にモデル化する軽量フレームワークであるDGPOを提案する。 DGPOは、前と逆の質問応答インスタンスを構造化された集合に整理し、一貫性のない推論パスと一貫性のない代替とを分離するマージンベースの可能性目標を最適化する。このグループワイドの定式化は、ペアワイドの目的よりもリッチな相対情報を捉え、多様な推論経路をまたいだ一貫性を強化する。実験の結果,構築したリバースデータは5つのベンチマークで平均3.2%,DGPOは複数のデータセットとモデルファミリで一貫したゲインを実現し,平均精度を最大3.6%向上した。

論文の概要: DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

関連論文リスト