FreePRM: Training Process Reward Models Without Ground Truth Process Labels
- URL: http://arxiv.org/abs/2506.03570v1
- Date: Wed, 04 Jun 2025 04:33:53 GMT
- Title: FreePRM: Training Process Reward Models Without Ground Truth Process Labels
- Authors: Lin Sun, Chuang Liu, Xiaofeng Ma, Tao Yang, Weijia Lu, Ning Wu,
- Abstract summary: FreePRM is a weakly supervised framework for training PRMs without access to ground-truth step-level labels.<n> Experimental results show that FreePRM achieves an average F1 score of 53.0% on ProcessBench, outperforming fully supervised PRM trained on Math-Shepherd by +24.1%.
- Score: 15.154544065092628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated that Process Reward Models (PRMs) play a crucial role in enhancing model performance. However, training PRMs typically requires step-level labels, either manually annotated or automatically generated, which can be costly and difficult to obtain at scale. To address this challenge, we introduce FreePRM, a weakly supervised framework for training PRMs without access to ground-truth step-level labels. FreePRM first generates pseudo step-level labels based on the correctness of final outcome, and then employs Buffer Probability to eliminate impact of noise inherent in pseudo labeling. Experimental results show that FreePRM achieves an average F1 score of 53.0% on ProcessBench, outperforming fully supervised PRM trained on Math-Shepherd by +24.1%. Compared to other open-source PRMs, FreePRM outperforms upon RLHFlow-PRM-Mistral-8B (28.4%) by +24.6%, EurusPRM (31.3%) by +21.7%, and Skywork-PRM-7B (42.1%) by +10.9%. This work introduces a new paradigm in PRM training, significantly reducing reliance on costly step-level annotations while maintaining strong performance.
Related papers
- VRPRM: Process Reward Modeling via Visual Reasoning [1.4076905229310113]
We propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy.<n>Using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K.
arXiv Detail & Related papers (2025-08-05T15:25:24Z) - Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model [56.92219181993453]
We propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix) to enable on-policyRFT methods like PPO and GRPO to leverage off-policy data.<n>ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady improvement.
arXiv Detail & Related papers (2025-07-09T14:29:45Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - Training Step-Level Reasoning Verifiers with Formal Verification Tools [10.625896243556578]
We propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools.<n>FoVer is feasible only for tasks compatible with formal verification.<n>We observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks.
arXiv Detail & Related papers (2025-05-21T19:23:45Z) - Process Reward Models That Think [86.88809596842428]
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling.<n>This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT)<n>We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs.
arXiv Detail & Related papers (2025-04-23T15:44:54Z) - Efficient Process Reward Model Training via Active Learning [27.846449143217704]
Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs)<n>We propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training.<n>A capable yet costly reasoning model then labels this data.<n>A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.
arXiv Detail & Related papers (2025-04-14T14:53:56Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.<n>We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)<n>We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - Free Process Rewards without Process Labels [55.14044050782222]
We show that an textitimplicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels.<n>We show that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction.
arXiv Detail & Related papers (2024-12-02T21:20:02Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.<n>We introduce a causal framework that learns preferences independent of these artifacts.<n>Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.