Related papers: Free Process Rewards without Process Labels

Free Process Rewards without Process Labels

URL: http://arxiv.org/abs/2412.01981v1
Date: Mon, 02 Dec 2024 21:20:02 GMT
Title: Free Process Rewards without Process Labels
Authors: Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng,
Abstract summary: We show that an textitimplicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels.<n>We show that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction.
Score: 55.14044050782222
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more fine grained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an \textit{implicit PRM} can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models, which can be optimized regardless of the specific choice of loss objectives. In experiments, we instantiate our implicit PRMs with various objectives and evaluate their performance on MATH. We show that our implicit PRM outperforms a strong MCTS-based baseline \textit{\'a la} Math-Shepherd using less than $1/38$ of the training data. Its performance can be further improved with majority voting. We further find that scaling up instructions and responses benefits our implicit PRM, and the latter brings a larger gain. Particularly, we find that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction, the setup that suffers from extreme data scarcity and imbalance. Further, instructions should be relevant to downstream tasks while the diversity of responses does not bring gains. Surprisingly, training on extra Math-Shepherd step labels brings no further improvements to our implicit PRM trained on only outcome data. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible.

Related papers

VRPRM: Process Reward Modeling via Visual Reasoning [1.4076905229310113]
We propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy.<n>Using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K.
arXiv Detail & Related papers (2025-08-05T15:25:24Z)
Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback [52.1410307583181]
We useReinforcement Learning from Human Feedback to train language models (LMs) to follow complex human preferences.<n>As training progresses, the responses generated by the LM no longer resemble the responses seen by the reward model (RM)<n>We propose Off-Policy Corrected Reward Modeling to correct the RM using importance weighting, without requiring new labels or samples.
arXiv Detail & Related papers (2025-07-21T11:19:04Z)
Training Step-Level Reasoning Verifiers with Formal Verification Tools [10.625896243556578]
We propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools.<n>FoVer is feasible only for tasks compatible with formal verification.<n>We observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks.
arXiv Detail & Related papers (2025-05-21T19:23:45Z)
Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z)
Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning [49.21525229904197]
We propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process.<n>We introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps.<n>Our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores.
arXiv Detail & Related papers (2025-05-20T14:12:05Z)
Process Reward Models That Think [86.88809596842428]
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT) We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs.
arXiv Detail & Related papers (2025-04-23T15:44:54Z)
Efficient Process Reward Model Training via Active Learning [27.846449143217704]
Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs) We propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training. A capable yet costly reasoning model then labels this data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.
arXiv Detail & Related papers (2025-04-14T14:53:56Z)
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [29.551802573731305]
We propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks.
arXiv Detail & Related papers (2025-02-19T18:35:55Z)
Process Reinforcement through Implicit Rewards [95.7442934212076]
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs) Dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive. We propose PRIME, which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards
arXiv Detail & Related papers (2025-02-03T15:43:48Z)
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [90.23629291067763]
A promising approach for improving reasoning in large language models is to use process reward models (PRMs) PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?" We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL.
arXiv Detail & Related papers (2024-10-10T17:31:23Z)
RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences. We introduce a causal framework that learns preferences independent of these artifacts. Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z)
Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z)
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [50.45155830888697]
We develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget.
arXiv Detail & Related papers (2024-06-06T07:40:00Z)
Higher Replay Ratio Empowers Sample-Efficient Multi-Agent Reinforcement Learning [47.17030172520195]
The sample efficiency for Multi-Agent Reinforcement Learning (MARL) is more challenging because of its inherent partial observability, non-stationary training, and enormous strategy space. We argue that the widely used episodic training mechanism could be a source of poor sample efficiency. To better exploit the data already collected, we propose to increase the frequency of the gradient updates per environment interaction.
arXiv Detail & Related papers (2024-04-15T12:18:09Z)
Reinforcement Learning from Bagged Reward [46.16904382582698]
In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent. In many real-world scenarios, designing immediate reward signals is difficult. We propose a novel reward redistribution method equipped with a bidirectional attention mechanism.
arXiv Detail & Related papers (2024-02-06T07:26:44Z)
Distributional Reward Estimation for Effective Multi-Agent Deep Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL) Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training. The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.