Fugu-MT 論文翻訳(概要): DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

論文の概要: DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

arxiv url: http://arxiv.org/abs/2510.02341v1
Date: Sat, 27 Sep 2025 03:06:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 16:35:52.018109
Title: DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning
Title（参考訳）: DRIFT: 実世界の嗜好学習におけるユーザ不満足から学ぶ
Authors: Yifan Wang, Bolian Li, Junlin Wu, Zhaoxuan Tan, Zheli Liu, Ruqi Zhang, Ananth Grama, Qingkai Zeng,
Abstract要約: textbfDRIFT (textbfDis-textbfRefined textbfFerence textbfTraining) を導入する。実世界のテキストでトレーニングされたDRIFTモデルWildFeedbackデータセットはWildBench Task Score上で最大+6.23% (7B) / +7.61% (14B)、最大+8.95% (7B) / +を達成している。
参考スコア（独自算出の注目度）: 43.698788115019376
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textit{WildFeedback} datasets and synthetic \textit{UltraFeedback} datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.
Abstract（参考訳）: 現実世界の大規模言語モデルのデプロイメント(会話型AIシステム、コード生成アシスタントなど)は、ユーザが改善、修正、表現された嗜好を通じてより良い回答を反復する一方で、明示的な満足度(SAT)フィードバックが不足しているため、自然に豊富な暗黙のユーザ不満(DSAT)信号を生成する。既存の嗜好学習アプローチはこのデータプロファイルと不一致である。本稿では、現実のDSAT信号のトレーニングを固定し、進化するポリシーから正のサンプルを動的に抽出する、 \textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}rainingを紹介する。実証的に、DRIFTモデルは実世界のtextit{WildFeedback}データセットと合成された \textit{UltraFeedback}データセットでトレーニングされ、WildBench Task Score上では+6.23\% (7B) / +7.61\% (14B)、AlpacaEval2上では+8.95\% (7B) / +12.29\% (14B) に到達し、反復的なDPOやSPINのような強力なベースライン手法よりも優れている。 DRIFTで訓練された14BモデルはWildBenchのGPT-4o-miniを上回った。さらなる分析により、DRIFTは探索能力も保ち、狭い部分集合に崩壊するのではなく、より多様な高逆解をもたらすことが示されている。理論的には、この設計は嗜好のマージンを保ち、勾配劣化を避けることを実証する。これらの結果から,DRIFTは実世界のポストトレーニングにおいて,最も豊富で情報に富む信号を活用する効果的でスケーラブルなレシピであることが示唆された。コードとデータはhttps://github.com/cacayaya/DRIFT.git.comで公開されている。

論文の概要: DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

関連論文リスト