Related papers: Towards Robust Process Reward Modeling via Noise-aware Learning

Towards Robust Process Reward Modeling via Noise-aware Learning

URL: http://arxiv.org/abs/2601.12748v1
Date: Mon, 19 Jan 2026 06:03:58 GMT
Title: Towards Robust Process Reward Modeling via Noise-aware Learning
Authors: Bin Xie, Bingbing Xu, Xueyun Tian, Yilin Chen, Huawei Shen,
Abstract summary: We propose a two-stage framework to mitigate noisy supervision.<n>In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge.<n>In the training stage, we propose a underlinetextbfIterative underlinetextbfTraining framework that enables the PRM to progressively refine noisy labels.
Score: 33.1289107681179
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Process Reward Models (PRMs) have achieved strong results in complex reasoning, but are bottlenecked by costly process-level supervision. A widely used alternative, Monte Carlo Estimation (MCE), defines process rewards as the probability that a policy model reaches the correct final answer from a given reasoning step. However, step correctness is an intrinsic property of the reasoning trajectory, and should be invariant to policy choice. Our empirical findings show that MCE producing policy-dependent rewards that induce label noise, including false positives that reward incorrect steps and false negatives that penalize correct ones. To address above challenges, we propose a two-stage framework to mitigate noisy supervision. In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge to detect reflection and self-correction behaviors related to the current reasoning step, thereby suppressing overestimated rewards. In the training stage, we further propose a \underline{\textbf{N}}oise-\underline{\textbf{A}}ware \underline{\textbf{I}}terative \underline{\textbf{T}}raining framework that enables the PRM to progressively refine noisy labels based on its own confidence. Extensive Experiments show that our method substantially improves step-level correctness discrimination, achieving up to a 27\% absolute gain in average F1 over PRMs trained with noisy supervision.

Related papers

P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering [51.04492568024515]
We introduce Probabilistic Process Supervision (P2S), a novel framework for fine-grained process rewards.<n>P2S provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps.
arXiv Detail & Related papers (2026-01-28T14:35:20Z)
Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning [59.76691952347156]
Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs)<n>Most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions.<n>We propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL.
arXiv Detail & Related papers (2026-01-26T21:38:20Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning [32.274434679047395]
Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs)<n>Standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect.<n>We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces.
arXiv Detail & Related papers (2026-01-20T18:15:38Z)
GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning [34.42899160708635]
We introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision.<n>GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision.<n>It achieves up to a 26% relative improvement in average performance on ProcessBench.
arXiv Detail & Related papers (2025-10-16T17:54:07Z)
Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking [78.69179041551014]
We propose an information-theoretic reward modeling framework based on the Information Bottleneck principle.<n>We show that InfoRM filters out preference-irrelevant information to alleviate reward misgeneralization.<n>We also introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape.
arXiv Detail & Related papers (2025-10-15T15:51:59Z)
Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards [40.905635870672945]
Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer.<n>In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability.<n>This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process.
arXiv Detail & Related papers (2025-10-09T04:30:45Z)
Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z)
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers [90.50039419576807]
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling.<n>To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $0,1$ during training.<n>This choice carries a cost: it introduces textitfalse negatives (rejecting correct answers, FNs) and textitfalse positives (accepting incorrect ones, FPs)
arXiv Detail & Related papers (2025-10-01T13:56:44Z)
Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning [30.302863491794543]
Process Reward Models (PRMs) aim to guide their step-by-step reasoning toward a final answer.<n>Existing PRMs fail to capture inter-step dependencies, or struggle to align process rewards with the final outcome.<n>We propose Conditional Reward Modeling that frames reasoning as a temporal process leading to a correct answer.
arXiv Detail & Related papers (2025-09-30T17:38:45Z)
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training [26.589591658693962]
Outcome Reward Models (ORMs) in RLVR are too coarse-grained to distinguish flawed reasoning within correct answers.<n>Process Reward Models (PRMs) offer fine-grained guidance for intermediate steps.<n>We introduce PRocess cOnsistency Filter (PROF) to harmonize noisy, fine-grained process rewards with accurate, coarse-grained outcome rewards.
arXiv Detail & Related papers (2025-09-03T15:28:51Z)
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [90.23629291067763]
A promising approach for improving reasoning in large language models is to use process reward models (PRMs) PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?" We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL.
arXiv Detail & Related papers (2024-10-10T17:31:23Z)
Two-phase Pseudo Label Densification for Self-training based Domain Adaptation [93.03265290594278]
We propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD. In the first phase, we use sliding window voting to propagate the confident predictions, utilizing intrinsic spatial-correlations in the images. In the second phase, we perform a confidence-based easy-hard classification. To ease the training process and avoid noisy predictions, we introduce the bootstrapping mechanism to the original self-training loss.
arXiv Detail & Related papers (2020-12-09T02:35:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.