Fugu-MT 論文翻訳(概要): Scaling Reward Modeling without Human Supervision

論文の概要: Scaling Reward Modeling without Human Supervision

arxiv url: http://arxiv.org/abs/2603.02225v1
Date: Wed, 11 Feb 2026 04:41:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.089086
Title: Scaling Reward Modeling without Human Supervision
Title（参考訳）: ヒューマン・スーパービジョンを使わずにリワード・モデリングをスケールする
Authors: Jingxuan Fan, Yueying Li, Zhenting Qi, Dinghuai Zhang, Kianté Brantley, Sham M. Kakade, Hanlin Zhang,
Abstract要約: 大規模ウェブコーパスから抽出した文書の接頭辞や接尾辞よりも好みの学習によって報酬ベースのスケーリングを運用する。人間のアノテーションは使用していないが、数学に焦点を当てたWebデータの1100万トークンのトレーニングは、RewardBench v1とv2で安定したゲインを得る。モデル全体では、RewardBench v2の精度は平均で+7.7ポイント向上し、ドメイン内の算術部分集合では+16.1まで向上する。
参考スコア（独自算出の注目度）: 52.10639750993359
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Learning from feedback is an instrumental process for advancing the capabilities and safety of frontier models, yet its effectiveness is often constrained by cost and scalability. We present a pilot study that explores scaling reward models through unsupervised approaches. We operationalize reward-based scaling (RBS), in its simplest form, as preference learning over document prefixes and suffixes drawn from large-scale web corpora. Its advantage is demonstrated in various aspects: despite using no human annotations, training on 11M tokens of math-focused web data yields steady gains on RewardBench v1 and v2, and these improvements consistently transfer across diverse initialization backbones spanning model families and scales. Across models, our method improves RewardBench v2 accuracy by up to +7.7 points on average, with gains of up to +16.1 on in-domain math subsets and consistent improvements on out-of-domain safety and general subsets. When applied to best-of-N selection and policy optimization, these reward models substantially improve downstream math performance and match or exceed strong supervised reward model baselines of similar size. Overall, we demonstrate the feasibility and promise of training reward models without costly and potentially unreliable human annotations.
Abstract（参考訳）: フィードバックから学ぶことは、フロンティアモデルの能力と安全性を向上させるための手段であるが、その効果はコストとスケーラビリティによって制約されることが多い。本稿では、教師なしアプローチによる報酬モデルのスケーリングを検討するパイロット研究について述べる。我々は、ドキュメントプレフィックスや大規模なWebコーパスから引き出された接尾辞に対する優先学習として、報酬ベースのスケーリング(RBS)を最も単純な形式で運用する。その利点は、人間のアノテーションを使わないにもかかわらず、RewardBench v1とv2の1100万トークンのトレーニングは、RewardBench v1とv2で安定した利益をもたらし、これらの改善はモデルファミリとスケールにまたがる様々な初期化バックボーンを一貫して移行している。モデル全体では、RewardBench v2の精度は平均で+7.7ポイント向上し、ドメイン内数学部分集合では+16.1まで向上し、ドメイン外の安全性と一般部分集合では一貫した改善がなされた。ベスト・オブ・N選択とポリシー最適化に適用した場合、これらの報酬モデルは下流の数学性能を大幅に改善し、同じ大きさの強い教師付き報酬モデルベースラインに適合または超過する。全体として、コストがかかり、信頼性の低い人間のアノテーションを使わずに、トレーニング報酬モデルの実現可能性と可能性を実証する。

論文の概要: Scaling Reward Modeling without Human Supervision

関連論文リスト