Related papers: Efficient Process Reward Model Training via Active Learning

Efficient Process Reward Model Training via Active Learning

URL: http://arxiv.org/abs/2504.10559v1
Date: Mon, 14 Apr 2025 14:53:56 GMT
Title: Efficient Process Reward Model Training via Active Learning
Authors: Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, Longxu Dou,
Abstract summary: Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs)<n>We propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training.<n>A capable yet costly reasoning model then labels this data.<n>A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.
Score: 27.846449143217704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

Related papers

VRPRM: Process Reward Modeling via Visual Reasoning [1.4076905229310113]
We propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy.<n>Using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K.
arXiv Detail & Related papers (2025-08-05T15:25:24Z)
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z)
FreePRM: Training Process Reward Models Without Ground Truth Process Labels [15.154544065092628]
FreePRM is a weakly supervised framework for training PRMs without access to ground-truth step-level labels.<n> Experimental results show that FreePRM achieves an average F1 score of 53.0% on ProcessBench, outperforming fully supervised PRM trained on Math-Shepherd by +24.1%.
arXiv Detail & Related papers (2025-06-04T04:33:53Z)
Training Step-Level Reasoning Verifiers with Formal Verification Tools [10.625896243556578]
We propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools.<n>FoVer is feasible only for tasks compatible with formal verification.<n>We observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks.
arXiv Detail & Related papers (2025-05-21T19:23:45Z)
Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning [49.21525229904197]
We propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process.<n>We introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps.<n>Our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores.
arXiv Detail & Related papers (2025-05-20T14:12:05Z)
Process Reward Models That Think [86.88809596842428]
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT) We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs.
arXiv Detail & Related papers (2025-04-23T15:44:54Z)
Entropy-Based Adaptive Weighting for Self-Training [15.089334734753677]
We propose Entropy-Based Adaptive Weighting for Self-Training (EAST)<n>EAST is an adaptive weighting strategy designed to prioritize uncertain data during self-training.<n>We evaluate our approach on GSM8K and MATH benchmarks.
arXiv Detail & Related papers (2025-03-31T10:04:35Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Free Process Rewards without Process Labels [55.14044050782222]
We show that an textitimplicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels.<n>We show that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction.
arXiv Detail & Related papers (2024-12-02T21:20:02Z)
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [90.23629291067763]
A promising approach for improving reasoning in large language models is to use process reward models (PRMs) PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?" We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL.
arXiv Detail & Related papers (2024-10-10T17:31:23Z)
Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z)
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models. We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.