Fugu-MT 論文翻訳(概要): Long-form RewardBench: Evaluating Reward Models for Long-form Generation

論文の概要: Long-form RewardBench: Evaluating Reward Models for Long-form Generation

arxiv url: http://arxiv.org/abs/2603.12963v1
Date: Fri, 13 Mar 2026 13:05:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.091694
Title: Long-form RewardBench: Evaluating Reward Models for Long-form Generation
Title（参考訳）: Long-form RewardBench:Long-form GenerationのためのRewardモデルの評価
Authors: Hui Huang, Yancheng He, Wei Liu, Muyun Yang, Jiaheng Liu, Kehai Chen, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao,
Abstract要約: Long-form RewardBenchは、ロングフォーム生成用に特別に設計された最初の報酬モデリングテストベッドである。ベンチマークには、QA、RAG、チャット、書き込み、推論の5つの重要なサブタスクが含まれています。以上の結果から,現在のモデルにはまだ長文報酬モデリング機能が欠けていることが明らかとなった。
参考スコア（独自算出の注目度）: 61.60385107031075
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.
Abstract（参考訳）: 強化学習に基づくアライメントの普及は、報酬モデルの重要性の高まりを浮き彫りにしている。さまざまなドメインやシナリオの報酬モデルを評価するために、さまざまなベンチマークが構築されている。しかし、現実世界の応用において重要な役割を担っているにもかかわらず、長期的な世代に対する報酬モデルを評価する際に大きなギャップが残っている。これを補うために,Long-form RewardBenchを紹介した。ベンチマークには、QA、RAG、チャット、書き込み、推論の5つの重要なサブタスクが含まれています。我々は,多段階データ収集プロセスを通じて指導データと嗜好データを収集し,分類器と生成モデルの両方を含む20以上の主流報酬モデルについて広範な実験を行った。以上の結果から,現在のモデルにはまだ長文報酬モデリング機能が欠けていることが明らかとなった。さらに,報酬モデリング性能と応答における誤差位置,および全体の応答長の相関を,分類モデルと生成モデルとの違いで明らかにした長文式ニードル・イン・ア・ヘイスタック試験を考案した。最後に、同一データ上で訓練された生成モデルと比較して、分類器がより一般化可能であることを示す。長期報酬モデリングの最初のベンチマークとして、この作業は、この重要な領域における進捗を視覚化するための堅牢なプラットフォームを提供することを目的としている。

論文の概要: Long-form RewardBench: Evaluating Reward Models for Long-form Generation

関連論文リスト