Fugu-MT 論文翻訳(概要): Adaptive Margin RLHF via Preference over Preferences

論文の概要: Adaptive Margin RLHF via Preference over Preferences

arxiv url: http://arxiv.org/abs/2509.22851v1
Date: Fri, 26 Sep 2025 19:03:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:18.910448
Title: Adaptive Margin RLHF via Preference over Preferences
Title（参考訳）: 優先よりも優先した適応マージンRLHF
Authors: Yaswanth Chittepu, Prasann Singhal, Greg Durrett, Scott Niekum,
Abstract要約: 好みの強さをモデル化することで、より一般化し、より忠実なアライメントにつながると我々は主張する。本稿では, DPO-PoP(Direct Preference Optimization, DPO-PoP)の拡張について紹介する。
参考スコア（独自算出の注目度）: 44.328333474444214
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.
Abstract（参考訳）: マージンベースの最適化は、分類タスクの一般化と堅牢性を改善するために基礎となる。 Reinforcement Learning from Human Feedback (RLHF)内の嗜好から得られる報酬モデル学習の文脈では、既存の方法は通常、選好評価の単純な機能であるマージン、固定マージン、マージンに依存しない。しかし、そのような定式化はしばしば異なる選好の様々な強みを考慮せず、例えば、いくつかの選好は反応間のより大きなマージンと関連付けられているか、あるいは評価から派生したノイズの多いマージン情報に依存している。好みの強さをモデル化することで、より一般化し、より忠実なアライメントにつながると我々は主張する。さらに、適応マージンを使用する既存の方法の多くは正確な選好スコアを前提としており、人間が確実に提供することは困難である。本稿では,2つの選好のどちらがより強い区別を反映しているかを示すアノテーションとして,選好よりも選好を利用するアプローチを提案する。我々は、この順序信号を用いて、データ単位の適応的マージンを推定する。本稿では, DPO-PoP(Direct Preference Optimization, DPO-PoP)の拡張について紹介する。提案手法は,UltraFeedbackデータセット上で,バニラDPO,固定利得DPO,地上利得DPOを上回った。さらに, 識別性能と生成性能のトレードオフがあることが示され, 特に, より強い選好を犠牲にして, より弱い選好を正しくラベル付けすることで, 生成品質の低下につながることが示唆された。このトレードオフをナビゲートするために,選好ラベルを収集する2つのサンプリング手法を提案する。

論文の概要: Adaptive Margin RLHF via Preference over Preferences

関連論文リスト