Fugu-MT 論文翻訳(概要): wDPO: Winsorized Direct Preference Optimization for Robust LLM Alignment

論文の概要: wDPO: Winsorized Direct Preference Optimization for Robust LLM Alignment

arxiv url: http://arxiv.org/abs/2603.07211v1
Date: Sat, 07 Mar 2026 13:30:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.091584
Title: wDPO: Winsorized Direct Preference Optimization for Robust LLM Alignment
Title（参考訳）: wDPO:ロバストLLMアライメントのWinsorized Direct Preference Optimization
Authors: Jilong Liu, Yonghui Yang, Pengyang Shao, Haokai Ma, Wei Qin, Richang Hong,
Abstract要約: 実際には、好みのデータはしばしばうるさい。既存のDPOの頑健な派生型は、主に一様の客観的な修正や大域的な再重み付けに依存している。目的の介入によって異なるノイズタイプに対処することで、ロバストな選好アライメントの利点が示される。
参考スコア（独自算出の注目度）: 48.487557157323664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct Preference Optimization (DPO) aligns large language models by optimizing pairwise preferences and has shown remarkable effectiveness as a simple and scalable alternative to RLHF. However, in practice, preference data are often noisy. Existing robust variants of DPO mainly rely on uniform objective modifications or global reweighting. While partially effective, these methods treat noisy samples as a homogeneous source of uncertainty and fail to distinguish between different noise types, leading to sub-optimal alignment robustness. In this work, we show that robust preference alignment benefits from addressing different noise types with targeted interventions rather than uniform regularization. We propose winsorized Direct Preference Optimization~(wDPO), a robust LLM alignment approach with hierarchical winsorization. Specifically, wDPO adopts a reward-free hierarchical intervention strategy that leverages only signals already available during DPO training. It first uses the implicit margin from DPO log-ratio to identify heterogeneous noise patterns without relying on external reward models. For hard noise, wDPO performs a data-level intervention by sparsely correcting strongly inconsistent preference pairs. For ambiguous comparisons, it applies a gradient-level intervention through soft winsorization, capping extreme losses in the high-loss tail to prevent weakly informative samples from dominating gradient updates. Extensive experiments on PKU-SafeRLHF and multiple external safety benchmarks demonstrate that wDPO consistently improves preference alignment quality and robustness over vanilla DPO and strong DPO-family baselines, with particularly pronounced gains under controlled label-flip noise.
Abstract（参考訳）: 直接選好最適化(DPO)は、ペアの選好を最適化することで大きな言語モデルを整列させ、RLHFの単純でスケーラブルな代替品として顕著な効果を示した。しかし、実際には、好みのデータはしばしばうるさい。既存のDPOの頑健な派生型は、主に一様の客観的な修正や大域的な再重み付けに依存している。これらの手法は部分的に有効であるが、ノイズサンプルを不確実性の均一な源として扱い、異なるノイズタイプを区別できないため、準最適アライメントロバスト性をもたらす。本研究は,一様正規化ではなく,目的とした介入によって異なるノイズタイプに対処することによる,ロバストな選好アライメントの利点を示す。我々は、階層的なウィンソライズを伴う頑健なLLMアライメントアプローチであるWinsorized Direct Preference Optimization~(wDPO)を提案する。具体的には、wDPOは報酬のない階層的介入戦略を採用し、DPOトレーニング中に既に利用可能な信号のみを活用する。まず、DPOの対数比の暗黙のマージンを使用して、外部の報酬モデルに頼ることなく、異種ノイズパターンを識別する。ハードノイズに対して、wDPOは、強い矛盾した選好ペアをわずかに補正することで、データレベルの介入を行う。曖昧な比較のために、軟弱なウィンゾリゼーションによる勾配レベルの介入を適用し、弱情報的なサンプルが勾配更新を支配できないように、高損失の尾部を極端に損なう。 PKU-SafeRLHFと複数の外部安全ベンチマークの広範囲な実験により、wDPOはバニラDPOと強力なDPOファミリーベースラインよりも、常に好みのアライメント品質とロバスト性を向上し、特に制御されたラベルフリップノイズ下での利得が顕著であることが示された。

論文の概要: wDPO: Winsorized Direct Preference Optimization for Robust LLM Alignment

関連論文リスト