Fugu-MT 論文翻訳(概要): Variance-aware Reward Modeling with Anchor Guidance

論文の概要: Variance-aware Reward Modeling with Anchor Guidance

arxiv url: http://arxiv.org/abs/2605.11865v1
Date: Tue, 12 May 2026 09:46:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.768577
Title: Variance-aware Reward Modeling with Anchor Guidance
Title（参考訳）: アンカー誘導による変数認識リワードモデリング
Authors: Shuxing Fang, Ruijian Han, Liangyu Zhang, Fan Zhou,
Abstract要約: 非識別性を解決するフレームワークとして,アンカー誘導型分散認識リワードモデリングを提案する。シミュレーション研究と4つの実世界の発散予測データセットを通して,提案手法は報酬モデリング性能を継続的に改善する。
参考スコア（独自算出の注目度）: 10.561814492691534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Standard Bradley--Terry (BT) reward models are limited when human preferences are pluralistic. Although soft preference labels preserve disagreement information, BT can only express it by shrinking reward margins. Gaussian reward models provide an alternative by jointly predicting a reward mean and a reward variance, but suffer from a fundamental non-identifiability from pairwise preferences alone. We propose Anchor-guided Variance-aware Reward Modeling, a framework that resolves this non-identifiability by augmenting preference data with two coarse response-level anchor labels. Building on this, we prove that two anchors are sufficient for identification, develop a joint training objective and establish a non-asymptotic convergence rate for both the estimated reward mean and variance functions. Across simulation studies and four real-world diverging-preference datasets, our method consistently improves reward modeling performance and downstream RLHF, including PPO training and best-of-$N$ selection.
Abstract（参考訳）: 標準ブラッドリー-テリー(BT)報酬モデルは、人間の嗜好が多元的である場合に制限される。ソフトな選好ラベルは不一致情報を保持するが、BTは報酬マージンを縮小することでのみ表現できる。ガウスの報酬モデルは、報酬平均と報酬分散を共同で予測することで代替手段を提供するが、ペアの選好だけでは基本的な非識別性に悩まされる。 Anchor-guided Variance-aware Reward Modelingは、2つの粗い応答レベルのアンカーラベルで好みデータを増やすことで、この非識別性を解消するフレームワークである。これに基づいて、2つのアンカーが同定し、共同訓練目標を策定し、推定された報酬平均と分散関数の両方に対する漸近収束率を確立するのに十分であることを示す。 PPOトレーニングやベスト・オブ・N$選択など,シミュレーション研究と4つの実世界の変動予測データセットを通じて,報酬モデリング性能と下流RLHFを継続的に改善する。

論文の概要: Variance-aware Reward Modeling with Anchor Guidance

関連論文リスト