Fugu-MT 論文翻訳(概要): LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

論文の概要: LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

arxiv url: http://arxiv.org/abs/2510.06915v1
Date: Wed, 08 Oct 2025 11:48:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.466792
Title: LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling
Title（参考訳）: LongRM: Reward Modelingのコンテキスト境界の解明とアンロック
Authors: Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang,
Abstract要約: 長コンテキストRM評価に特化して設計されたベンチマークであるLong-RewardBenchを紹介する。予備研究により、最先端の生成型RMでさえ、長いコンテキストシナリオにおいて重大な脆弱性を示すことが明らかとなった。本稿では、任意のモデルを堅牢なLong-context RMに効果的にスケールする一般的なマルチステージトレーニング戦略を提案する。
参考スコア（独自算出の注目度）: 45.520815757751194
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.
Abstract（参考訳）: リワードモデル(RM)は、大きな言語モデル(LLM)と人間の嗜好の整合において重要な役割を担っている。現実世界のアプリケーションは、LLMエージェントのような長い歴史の軌跡をますます含んでいるため、モデルの応答が高品質であるだけでなく、提供されたコンテキストに根ざし、一貫性があるかどうかを評価することは不可欠である。しかし、現在のRMは、短期的なコンテキスト設定に限定されており、主に応答レベルの属性(例えば、安全性や有用性)に焦点を当てています。本稿では,Pairwise ComparisonとBest-of-Nタスクの両方を特徴とする長コンテキストRM評価のためのベンチマークであるLong-RewardBenchを紹介する。我々の予備研究は、最先端の生成型RMでさえ、長期コンテキストシナリオにおいて重大な脆弱性を示し、文脈認識の嗜好判断を維持できないことを明らかにした。モデル出力で観測される故障パターンの分析により、任意のモデルを堅牢なLong-context RM(LongRMs)に効果的にスケールする一般的な多段階トレーニング戦略を提案する。実験により,本手法は長文評価の性能を大幅に向上するだけでなく,短文評価能力も向上することが示された。特に、私たちの8B LongRMは70Bスケールのベースラインをはるかに上回り、プロプライエタリなGemini 2.5 Proモデルのパフォーマンスに匹敵します。

論文の概要: LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

関連論文リスト