Fugu-MT 論文翻訳(概要): What Matters in Data for DPO?

論文の概要: What Matters in Data for DPO?

arxiv url: http://arxiv.org/abs/2508.18312v1
Date: Sat, 23 Aug 2025 16:00:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-27 17:42:38.515173
Title: What Matters in Data for DPO?
Title（参考訳）: DPOのデータで何が重要か?
Authors: Yu Pan, Zhongze Cai, Guanting Chen, Huaiyang Zhong, Chonghuan Wang,
Abstract要約: DPO(Direct Preference Optimization)は、大規模言語モデルを人間の好みに合わせるためのシンプルで効果的なアプローチとして登場した。本研究では,DPOの嗜好データ分布が理論的・経験的両面からどう影響するかを体系的に検討する。選択された応答の質がDPOの目的を最適化する上で重要な役割を担っているのに対し、拒否された応答の質は比較的限定的な影響を持つ可能性があることを示す。
参考スコア（独自算出の注目度）: 6.208229499655634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental question remains open: what characteristics of preference data are most critical for DPO performance? In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. We show that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses may have relatively limited impact. Our theoretical analysis characterizes the optimal response distribution under DPO and reveals how contrastiveness between responses helps primarily by improving the chosen samples. We further study an online DPO setting and show it effectively reduces to supervised fine-tuning on the chosen responses. Extensive experiments across diverse tasks confirm our findings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the benefit of mixing the on-policy data. Our results interpret the mechanism behind some widely adopted strategies and offer practical insights for constructing high-impact preference datasets for LLM alignment.
Abstract（参考訳）: 直接選好最適化(DPO)は、学習された報奨モデルの必要性を回避し、大規模言語モデル(LLM)と人間の選好を整合させる、シンプルで効果的なアプローチとして登場した。採用が増えているにもかかわらず、根本的な疑問が残る。DPOのパフォーマンスにおいて、好みデータの特徴は最も重要であるか? 本研究では,理論的および経験的両面から,嗜好データ分布がDPOに与える影響について,系統的研究を行った。選択された応答の質がDPOの目的を最適化する上で重要な役割を担っているのに対し、拒否された応答の質は比較的限定的な影響を持つ可能性があることを示す。我々の理論解析は, DPO 下での最適応答分布を特徴付けるとともに, 選択したサンプルを改良することで, 応答間のコントラスト性がいかに有効かを明らかにする。さらに、オンラインDPO設定について検討し、選択した応答の教師付き微調整を効果的に行うことを示す。選択された応答の品質の改善は、拒否された応答の品質に関わらず、一貫してパフォーマンスを向上する。また、政治上のデータを混在させることの利点についても検討する。この結果は広く採用されている戦略の背景にあるメカニズムを解釈し、LLMアライメントのための高インパクトな選好データセットを構築するための実践的な洞察を提供する。

論文の概要: What Matters in Data for DPO?

関連論文リスト