Adversarial Training for Process Reward Models
- URL: http://arxiv.org/abs/2511.22888v1
- Date: Fri, 28 Nov 2025 05:32:01 GMT
- Title: Adversarial Training for Process Reward Models
- Authors: Gurusha Juneja, Deepak Nathani, William Yang Wang,
- Abstract summary: We introduce Adversarially Trained PRMs (textttAPRM), where a Generator ($G$) learns to produce reasoning errors to deceive a PRM ($R$)<n>This interaction yields progressively harder negatives for $R$, improving its robustness and generalization to novel errors without requiring manual step-level labels.
- Score: 47.92183495904245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing step-level supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data to novel errors. We introduce Adversarially Trained PRMs (\texttt{APRM}), where a Generator ($G$) learns to produce reasoning errors to deceive a PRM ($R$), while $R$ concurrently learns to detect them. This interaction yields progressively harder negatives for $R$, improving its robustness and generalization to novel errors without requiring manual step-level labels. Averaged across diverse mathematical reasoning benchmarks, \texttt{APRM} improves solver accuracy by $+3.4$ percentage points (pp) over the strongest PRM baseline. \texttt{APRM} achieves gains of $+5.3$ pp on out-of-distribution tasks.
Related papers
- Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning [59.76691952347156]
Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs)<n>Most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions.<n>We propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL.
arXiv Detail & Related papers (2026-01-26T21:38:20Z) - GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning [34.42899160708635]
We introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision.<n>GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision.<n>It achieves up to a 26% relative improvement in average performance on ProcessBench.
arXiv Detail & Related papers (2025-10-16T17:54:07Z) - When Agents go Astray: Course-Correcting SWE Agents with PRMs [7.017285839527226]
Large Language Model (LLM) agents are increasingly deployed for complex, multi-step software engineering (SWE) tasks.<n>Their trajectories often contain costly inefficiencies, such as redundant exploration, looping, and failure to terminate once a solution is reached.<n>In this paper, we introduce SWE-PRM, an inference-time Process Reward Model (PRM) that intervenes during execution to detect and course-correct trajectory-level errors.
arXiv Detail & Related papers (2025-09-02T14:23:15Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - Generalizable Process Reward Models via Formally Verified Training Data [13.781401358802462]
FoVer is an approach to synthesize PRM training data with accurate step-level error labels automatically annotated by formal verification tools.<n>Experiments show that PRMs trained with FoVer exhibit cross-task generalization, enabling a single PRM to effectively perform verification across diverse reasoning tasks.
arXiv Detail & Related papers (2025-05-21T19:23:45Z) - Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning [49.21525229904197]
We propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process.<n>We introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps.<n>Our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores.
arXiv Detail & Related papers (2025-05-20T14:12:05Z) - Process Reward Models That Think [85.06022494911811]
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling.<n>This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT)<n>We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs.
arXiv Detail & Related papers (2025-04-23T15:44:54Z) - AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [29.551802573731305]
We propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word.<n>We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks.
arXiv Detail & Related papers (2025-02-19T18:35:55Z) - Free Process Rewards without Process Labels [55.14044050782222]
We show that an textitimplicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels.<n>We show that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction.
arXiv Detail & Related papers (2024-12-02T21:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.