Related papers: The Lessons of Developing Process Reward Models in Mathematical Reasoning

The Lessons of Developing Process Reward Models in Mathematical Reasoning

URL: http://arxiv.org/abs/2501.07301v1
Date: Mon, 13 Jan 2025 13:10:16 GMT
Title: The Lessons of Developing Process Reward Models in Mathematical Reasoning
Authors: Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin,
Abstract summary: Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.<n>We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)<n>We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
Score: 62.165534879284735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

Related papers

Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns [79.42805969325036]
Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks.<n>PRMs are required to identify errors under various reasoning patterns during the reasoning process.<n>Existing benchmarks mainly focus on evaluating PRMs with stepwise correctness.<n>We introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns.
arXiv Detail & Related papers (2025-05-29T14:26:53Z)
From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling [32.72867198629561]
We investigate the interplay between pre-training and reward model training FLOPs to assess their influence on PRM efficiency and accuracy.<n>Our findings indicate that PRMs trained on mathematical datasets exhibit performance comparable to those tailored for code generation.
arXiv Detail & Related papers (2025-05-24T12:44:15Z)
Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning [49.21525229904197]
We propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process.<n>We introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps.<n>Our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores.
arXiv Detail & Related papers (2025-05-20T14:12:05Z)
R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy. We propose Reasoning-Driven Process Reward Modeling (R-PRM) R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z)
Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning [32.850036320802474]
We introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup. Our experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets.
arXiv Detail & Related papers (2025-02-20T08:40:09Z)
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model. We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z)
ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [25.329712997545794]
We propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR) ReARTeR enhances RAG systems' reasoning capabilities through post-training and test-time scaling. Experimental results on multi-step reasoning benchmarks demonstrate significant improvements.
arXiv Detail & Related papers (2025-01-14T05:56:26Z)
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [28.74956741932006]
We introduce PRMBench, a process-level benchmark to assess the fine-grained error detection capabilities of PRMs.<n>PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions.
arXiv Detail & Related papers (2025-01-06T16:31:45Z)
Entropy-Regularized Process Reward Model [30.279394036823092]
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning.<n>We propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP)<n>Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models.
arXiv Detail & Related papers (2024-12-15T01:09:23Z)
Process Reward Model with Q-Value Rankings [18.907163177605607]
Process Reward Modeling (PRM) is critical for complex reasoning and decision-making tasks. We introduce the Process Q-value Model (PQM), a novel framework that redefines PRM in the context of a Markov Decision Process. PQM optimize Q-value rankings based on a novel comparative loss function, enhancing the model's ability to capture the intricate dynamics among sequential decisions.
arXiv Detail & Related papers (2024-10-15T05:10:34Z)
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment [44.84304822376291]
Reward models (RMs) guide the alignment of large language models (LLMs) We propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs.
arXiv Detail & Related papers (2024-10-13T16:06:54Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem. PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins. Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.