Fugu-MT 論文翻訳(概要): Robust Preference Optimization: Aligning Language Models with Noisy Preference Feedback

論文の概要: Robust Preference Optimization: Aligning Language Models with Noisy Preference Feedback

arxiv url: http://arxiv.org/abs/2509.24159v1
Date: Mon, 29 Sep 2025 01:17:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.670841
Title: Robust Preference Optimization: Aligning Language Models with Noisy Preference Feedback
Title（参考訳）: ロバストな選好最適化:雑音の多い選好フィードバックを持つ言語モデルの調整
Authors: Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu,
Abstract要約: 本稿では、アライメント法を改善するために、Robust Preference Optimization (RPO)を導入する。 RPOは、各ラベルの正しさの後方確率を推測するために、期待最大化(EM)アルゴリズムを用いる。我々の実験は、RPOがメタフレームワークとして有効であることを示し、4つの最先端アライメントアルゴリズムを一貫して強化した。
参考スコア（独自算出の注目度）: 7.1259212876994695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone technology for aligning Large Language Models (LLMs) with human values. However, these methods are all underpinned by a critical, yet flawed assumption: human preferences are homogeneous (representing a single, unified preference) and the collected data is noiseless (free from error). In reality, neither is true since human preference is pluralistic and annotators can make mistakes. This creates a discrepancy between the recorded data and the ground-truth preferences, which can misguide the model and degrade its performance. To address this challenge, we introduce Robust Preference Optimization (RPO). RPO employs an Expectation-Maximization (EM) algorithm to infer the posterior probability of each label's correctness, which is used to adaptively re-weigh each data point in the training loss to mitigate noise. We further generalize this approach by establishing a theoretical link between arbitrary preference losses and their corresponding probabilistic models. This generalization enables the systematic transformation of existing alignment algorithms into their robust counterparts, elevating RPO from a specific algorithm to a meta-framework for robust preference alignment. Theoretically, we prove that under the condition of a perfectly calibrated model, RPO is guaranteed to converge to the true noise level of the dataset. Our experiments demonstrate RPO's effectiveness as a meta-framework, consistently enhancing four state-of-the-art alignment algorithms (DPO, IPO, SimPO, and CPO). When applied to Mistral and Llama 3 models, the RPO-enhanced methods achieve substantial win rate gains on AlpacaEval 2 and Arena-Hard, with improvements of up to 7.0% and 5.4%, respectively.
Abstract（参考訳）: ヒューマンフィードバックからの強化学習(Reinforcement Learning from Human Feedback, RLHF)のような人間の嗜好に基づくアライメント手法は、大規模言語モデル(LLM)と人間の価値を整合させるための基礎技術である。人間の選好は均質であり(単一で統一された選好を表す)、収集されたデータは(エラーのない)ノイズ無しである。実際には、人間の好みは多元的であり、アノテータは間違いを犯す可能性があるため、どちらも事実ではない。これにより、記録されたデータと地味の嗜好の相違が生じ、モデルが誤導され、性能が低下する可能性がある。この課題に対処するために、我々はRobust Preference Optimization (RPO)を紹介する。 RPOは、各ラベルの正しさの後方確率を推定するために期待最大化(EM)アルゴリズムを使用し、トレーニング損失における各データポイントを適応的に再検討してノイズを緩和する。任意の選好損失とそれに対応する確率モデルとの理論的リンクを確立することで、このアプローチをさらに一般化する。この一般化により、既存のアライメントアルゴリズムをロバストなアライメントに体系的に変換することができ、RPOを特定のアルゴリズムからメタフレームワークに高め、ロバストなリライメントアライメントを実現する。理論的には、完全に校正されたモデルの下では、RPOはデータセットの真のノイズレベルに収束することが保証される。我々の実験は、RPOがメタフレームワークとして有効であることを示し、4つの最先端アライメントアルゴリズム(DPO、IPO、SimPO、CPO)を一貫して強化した。 Mistral 3 モデルと Llama 3 モデルに適用すると、RPO強化方式はAlpacaEval 2 と Arena-Hard でそれぞれ7.0%と5.4%の改善を達成している。

論文の概要: Robust Preference Optimization: Aligning Language Models with Noisy Preference Feedback

関連論文リスト