Rethinking Reward Models for Multi-Domain Test-Time Scaling
- URL: http://arxiv.org/abs/2510.00492v2
- Date: Thu, 02 Oct 2025 02:37:21 GMT
- Title: Rethinking Reward Models for Multi-Domain Test-Time Scaling
- Authors: Dong Bok Lee, Seanie Lee, Sangwoo Park, Minki Kang, Jinheon Baek, Dongki Kim, Dominik Wagner, Jiongdao Jin, Heejun Lee, Tobias Bocklet, Jinyu Wang, Jingjing Fu, Sung Ju Hwang, Jiang Bian, Lei Song,
- Abstract summary: Prior work generally assumes that process reward models (PRMs) outperform outcome reward models (ORMs) that assess only the final answer.<n>We present the first unified evaluation of four reward model variants across 14 diverse domains.<n>We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories.
- Score: 91.76069784586149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The reliability of large language models (LLMs) during test-time scaling is often assessed with \emph{external verifiers} or \emph{reward models} that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM (\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not competitive, and (iii) overall, \GenORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at \href{https://github.com/db-Lee/Multi-RM}{\underline{\small\texttt{https://github.com/db-Lee/Multi-RM}}} to facilitate future research in multi-domain settings.
Related papers
- How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains [7.845652284569666]
miscalibration of Large Reasoning Models undermines their reliability in high-stakes domains.<n>We introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs.
arXiv Detail & Related papers (2026-01-13T01:55:48Z) - Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering [14.119525003137356]
Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores.<n>This work presents the first systematic study of PRMs for table question answering (TQA)<n>We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives.
arXiv Detail & Related papers (2025-10-23T07:49:39Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning [33.574626079343936]
We introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs.<n>In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights.<n>In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset.
arXiv Detail & Related papers (2025-05-26T17:20:17Z) - MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision [27.571090189791303]
We propose MM-PRM, a process reward model trained within a fully automated, scalable framework.<n>We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data.<n>We generate over 700k step-level annotations without human labeling.
arXiv Detail & Related papers (2025-05-19T17:55:08Z) - Process Reward Models That Think [85.06022494911811]
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling.<n>This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT)<n>We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs.
arXiv Detail & Related papers (2025-04-23T15:44:54Z) - VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data [21.460891616139534]
We introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method.<n> VersaPRM achieves consistent performance gains across diverse domains.<n>We further contribute to the community by open-sourcing all data, code and models for VersaPRM.
arXiv Detail & Related papers (2025-02-10T18:03:36Z) - SMaRt: Improving GANs with Score Matching Regularity [114.43433222721025]
Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex.<n>We find that score matching serves as a promising solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold.<n>We show that our approach can consistently boost the performance of various state-of-the-art GANs on real-world datasets with pre-trained diffusion models acting as the approximate score function.
arXiv Detail & Related papers (2023-11-30T03:05:14Z) - Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments.
Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains.
We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z) - META: Mimicking Embedding via oThers' Aggregation for Generalizable
Person Re-identification [68.39849081353704]
Domain generalizable (DG) person re-identification (ReID) aims to test across unseen domains without access to the target domain data at training time.
This paper presents a new approach called Mimicking Embedding via oThers' Aggregation (META) for DG ReID.
arXiv Detail & Related papers (2021-12-16T08:06:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.