More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
- URL: http://arxiv.org/abs/2503.22233v3
- Date: Thu, 09 Oct 2025 02:37:01 GMT
- Title: More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
- Authors: Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu, Yuxian Wang, Wu Ning, Qian Chen, Mofan Peng, Zijie Chen, Peishuo Su, Sirui Han, Yitong Li,
- Abstract summary: We introduce EDU-PRM, a novel entropy-driven training framework for process reward modeling.<n>It enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations.<n>On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines.<n>By leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage.
- Score: 14.360157606494232
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.
Related papers
- TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning [77.01182934427095]
TaTToo is a novel table-grounded PRM framework that integrates tool-based verification to provide precise reward supervision.<n>We train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning to align our model with table-based verification.
arXiv Detail & Related papers (2025-10-07T17:59:41Z) - Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning [10.227089771963943]
We propose an uncertainty-driven framework for automated process reward data construction.<n>We introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and weighted Reward Frequency Vote.<n>Experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework.
arXiv Detail & Related papers (2025-08-03T14:14:13Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - Reinforce LLM Reasoning through Multi-Agent Reflection [8.088795955922656]
We introduce DPSDP, a reinforcement learning algorithm that trains an actor-critic LLM system to iteratively refine answers via direct preference learning on self-generated data.<n>Theoretically, DPSDP can match the performance of any policy within the training distribution.<n>For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models.
arXiv Detail & Related papers (2025-06-10T02:43:47Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - Entropy-Based Adaptive Weighting for Self-Training [15.089334734753677]
We propose Entropy-Based Adaptive Weighting for Self-Training (EAST)
EAST is an adaptive weighting strategy designed to prioritize uncertain data during self-training.
We evaluate our approach on GSM8K and MATH benchmarks.
arXiv Detail & Related papers (2025-03-31T10:04:35Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - Better Process Supervision with Bi-directional Rewarding Signals [41.929678717412266]
We introduce BiRM, a novel process supervision model that evaluates correctness of previous steps and also models the probability of future success.<n>We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps.<n>In search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
arXiv Detail & Related papers (2025-03-06T17:03:17Z) - An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning [11.691011429576243]
This paper introduces a framework called EpicPRM, which annotates each intermediate reasoning step based on its quantified contribution.<n>We efficiently construct a high-quality process supervision training dataset named Epic50k, consisting of 50k annotated intermediate steps.
arXiv Detail & Related papers (2025-03-04T08:18:46Z) - AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [29.551802573731305]
We propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word.<n>We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks.
arXiv Detail & Related papers (2025-02-19T18:35:55Z) - Process Reinforcement through Implicit Rewards [95.7442934212076]
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs)
Dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards.
This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive.
We propose PRIME, which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards
arXiv Detail & Related papers (2025-02-03T15:43:48Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.<n>We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)<n>We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - Entropy-Regularized Process Reward Model [30.279394036823092]
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning.<n>We propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP)<n>Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models.
arXiv Detail & Related papers (2024-12-15T01:09:23Z) - The Surprising Effectiveness of Test-Time Training for Few-Shot Learning [59.309477460893916]
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks.<n>We investigate the effectiveness of test-time training (TTT) as a mechanism for improving LMs' reasoning and few-shot learning capabilities.<n>Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Continuous Approximations for Improving Quantization Aware Training of LLMs [4.435218424434634]
Quantization Aware Training (QAT), an effective model compression method, is proposed to reduce performance degradation after quantization.
We introduce two continuous approximations to the QAT process on the rounding function, traditionally approximated by the Straight-Through Estimator (STE) and the clamping function.
By applying both methods, the perplexity (PPL) on the WikiText-v2 dataset of the quantized model reaches 9.0815, outperforming 9.9621 by the baseline.
arXiv Detail & Related papers (2024-10-06T04:33:06Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Fast-ELECTRA for Efficient Pre-training [83.29484808667532]
ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model.
We propose Fast-ELECTRA, which leverages an existing language model as the auxiliary model.
Our approach rivals the performance of state-of-the-art ELECTRA-style pre-training methods, while significantly eliminating the computation and memory cost brought by the joint training of the auxiliary model.
arXiv Detail & Related papers (2023-10-11T09:55:46Z) - No MCMC for me: Amortized sampling for fast and stable training of
energy-based models [62.1234885852552]
Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty.
We present a simple method for training EBMs at scale using an entropy-regularized generator to amortize the MCMC sampling.
Next, we apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training.
arXiv Detail & Related papers (2020-10-08T19:17:20Z) - How to Train Your Energy-Based Model for Regression [107.54411649704194]
Energy-based models (EBMs) have become increasingly popular within computer vision in recent years.
Recent work has applied EBMs also for regression tasks, achieving state-of-the-art performance on object detection and visual tracking.
How EBMs should be trained for best possible regression performance is not a well-studied problem.
arXiv Detail & Related papers (2020-05-04T17:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.