Fugu-MT 論文翻訳(概要): Rethinking Reward Models for Multi-Domain Test-Time Scaling

論文の概要: Rethinking Reward Models for Multi-Domain Test-Time Scaling

arxiv url: http://arxiv.org/abs/2510.00492v2
Date: Thu, 02 Oct 2025 02:37:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.371818
Title: Rethinking Reward Models for Multi-Domain Test-Time Scaling
Title（参考訳）: マルチドメインテスト時間スケーリングのためのリワードモデルの再考
Authors: Dong Bok Lee, Seanie Lee, Sangwoo Park, Minki Kang, Jinheon Baek, Dongki Kim, Dominik Wagner, Jiongdao Jin, Heejun Lee, Tobias Bocklet, Jinyu Wang, Jingjing Fu, Sung Ju Hwang, Jiang Bian, Lei Song,
Abstract要約: 従来の作業では、プロセス報酬モデル(PRM)が最終回答のみを評価する結果報酬モデル(ORM)を上回っていると仮定しています。 14の異なる領域にまたがる4つの報酬モデル変種を統一的に評価する。 LLMの自動ラベル付けからラベルノイズを継承し,長い推論軌跡の評価に難渋するPRM方式の段階的スコアリングが原因と考えられる。
参考スコア（独自算出の注目度）: 91.76069784586149
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The reliability of large language models (LLMs) during test-time scaling is often assessed with \emph{external verifiers} or \emph{reward models} that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM (\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not competitive, and (iii) overall, \GenORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at \href{https://github.com/db-Lee/Multi-RM}{\underline{\small\texttt{https://github.com/db-Lee/Multi-RM}}} to facilitate future research in multi-domain settings.
Abstract（参考訳）: テスト時間スケーリング中の大きな言語モデル(LLM)の信頼性は、しばしば、欠陥論理からの正しい推論を区別する \emph{external verifiers} または \emph{reward model} で評価される。これまでの作業では、プロセス報酬モデル(PRM)は、最終回答のみを評価する結果報酬モデル(ORM)よりも優れています。この見解は、主に狭義の数学的な領域からの証拠に基づいている。差別的ORMとPRM(\DisORM, \DisPRM)と生成的ORMとPRM(\GenORM, \GenPRM)を14のドメインで比較した。従来の知恵とは対照的に、私たちはそれを見つける。 (i) \DisORM は \DisPRM, (ii)GenPRMは競争力がなく、 (iii) 全体として、 \GenORMは最も堅牢で、テスト対象のドメイン毎に大きく一貫した利得をもたらします。我々は,LPMの自動ラベル付けからラベルノイズを継承し,自己修正推論を含む長い推論軌跡の評価が困難であるPRMスタイルの段階的スコアリングを特徴としている。理論的解析により, 推理長さが大きくなるにつれて, ステップワイドアグリゲーション化合物の誤差が増加し, 経験的観察によりこの効果が確認された。これらの知見は、きめ細かい監督が常に優れているという仮定に挑戦し、マルチドメインデプロイメントにおける生成結果の検証をサポートする。私たちは、コード、データセット、チェックポイントを \href{https://github.com/db-Lee/Multi-RM}{\underline{\small\textt{https://github.com/db-Lee/Multi-RM}}} で公開しています。

論文の概要: Rethinking Reward Models for Multi-Domain Test-Time Scaling

関連論文リスト