Fugu-MT 論文翻訳(概要): Latent Collective Preference Optimization: A General Framework for Robust LLM Alignment

論文の概要: Latent Collective Preference Optimization: A General Framework for Robust LLM Alignment

arxiv url: http://arxiv.org/abs/2509.24159v2
Date: Wed, 01 Oct 2025 03:46:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-02 12:11:26.790299
Title: Latent Collective Preference Optimization: A General Framework for Robust LLM Alignment
Title（参考訳）: Latent Collective Preference Optimization: Robust LLMアライメントのための汎用フレームワーク
Authors: Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu,
Abstract要約: 雑音データから潜在集団コンセンサスを学習するためにLCPO(Latent Collective Preference Optimization)を導入する。本実験はLCPOの汎用フレームワークとしての有効性を実証し、4つの最先端アライメントアルゴリズムを一貫して強化した。 Mistral と Llama 3 モデルに適用すると、LCPO を拡張した手法は AlpacaEval 2 と Arena-Hard でかなりの利得を達成し、両方のベンチマークで最大 7.0 % 改善した。
参考スコア（独自算出の注目度）: 7.1259212876994695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone technology for aligning Large Language Models (LLMs) with human values. However, these methods are all underpinned by a critical, yet flawed assumption: human preferences are homogeneous (representing a single, unified preference) and the collected data is noiseless (free from error). In reality, neither is true since human preference is pluralistic and annotators can make mistakes. This creates a discrepancy between the recorded data and the ground-truth preferences, which can misguide the model and degrade its performance. To address this challenge, we introduce Latent Collective Preference Optimization (LCPO). LCPO leverages an Expectation-Maximization (EM) algorithm to learn the latent collective consensus from noisy data. It operates by inferring the correctness of each preference label and using this probability as an adaptive weight to re-calibrate each data point's contribution to the training loss, thereby mitigating noise. We generalize this approach by establishing a theoretical link between arbitrary preference losses and their corresponding probabilistic models, elevating LCPO from a specific algorithm to a general framework for robust preference alignment. Theoretically, we prove that under the condition of a perfectly calibrated model, LCPO is guaranteed to converge to the true noise level of the dataset. Our experiments demonstrate LCPO's effectiveness as a general framework, consistently enhancing four state-of-the-art alignment algorithms (DPO, IPO, SimPO, and CPO). When applied to Mistral and Llama 3 models, the LCPO-enhanced methods achieve substantial win rate gains on AlpacaEval 2 and Arena-Hard, with improvements of up to 7.0% on both benchmarks.
Abstract（参考訳）: ヒューマンフィードバックからの強化学習(Reinforcement Learning from Human Feedback, RLHF)のような人間の嗜好に基づくアライメント手法は、大規模言語モデル(LLM)と人間の価値を整合させるための基礎技術である。人間の選好は均質であり(単一で統一された選好を表す)、収集されたデータは(エラーのない)ノイズ無しである。実際には、人間の好みは多元的であり、アノテータは間違いを犯す可能性があるため、どちらも事実ではない。これにより、記録されたデータと地味の嗜好の相違が生じ、モデルが誤導され、性能が低下する可能性がある。この課題に対処するために、Latent Collective Preference Optimization (LCPO)を導入する。 LCPOは期待最大化(EM)アルゴリズムを利用して、ノイズの多いデータから潜在集団のコンセンサスを学習する。各選好ラベルの正しさを推測し、この確率を適応重みとして使用することにより、各データポイントのトレーニング損失への寄与を再校正し、ノイズを緩和する。我々は,任意の選好損失とそれに対応する確率モデルとの理論的関係を確立し,LCPOを特定のアルゴリズムから高め,ロバストな選好アライメントのための一般的なフレームワークに拡張することで,このアプローチを一般化する。理論的には、完全に校正されたモデルの下では、LCPOはデータセットの真のノイズレベルに収束することが保証される。本実験はLCPOの汎用フレームワークとしての有効性を実証し、4つの最先端アライメントアルゴリズム(DPO, IPO, SimPO, CPO)を一貫して強化した。 Mistral と Llama 3 モデルに適用すると、LCPO を拡張した手法は AlpacaEval 2 と Arena-Hard でかなりの利得を達成し、両方のベンチマークで最大 7.0 % 改善した。

論文の概要: Latent Collective Preference Optimization: A General Framework for Robust LLM Alignment

関連論文リスト