Related papers: Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

URL: http://arxiv.org/abs/2510.00915v2
Date: Fri, 17 Oct 2025 16:20:55 GMT
Title: Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Authors: Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama,
Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling.<n>To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $0,1$ during training.<n>This choice carries a cost: it introduces textitfalse negatives (rejecting correct answers, FNs) and textitfalse positives (accepting incorrect ones, FPs)
Score: 90.50039419576807
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $\{0,1\}$ during training. This choice carries a cost: it introduces \textit{false negatives} (rejecting correct answers, FNs) and \textit{false positives} (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction $\frac{12}{36}$ as wrong when compared against the canonical $\frac{1}{3}$ due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a \textit{backward} correction that de-biases the observed binary reward to recover an \textit{unbiased} estimator of the clean policy gradient. The second is a \textit{forward} correction that reweights score-function terms so that the expected update direction aligns with the \textit{clean gradient}; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO)-based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.

Related papers

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning [17.384089089363382]
We identify a root cause that existing methods overlook: the uniform penalization of errors.<n>Current approaches treat all incorrect rollouts within a group identically.<n>We propose the Asymmetric Confidence-aware Error Penalty (ACE)
arXiv Detail & Related papers (2026-02-24T22:46:43Z)
PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering [71.15346406323827]
We introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification.<n>We find that current verifiers frequently fail to detect derivation flaws.<n>We propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME.
arXiv Detail & Related papers (2026-02-12T04:45:01Z)
Towards Robust Process Reward Modeling via Noise-aware Learning [33.1289107681179]
We propose a two-stage framework to mitigate noisy supervision.<n>In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge.<n>In the training stage, we propose a underlinetextbfIterative underlinetextbfTraining framework that enables the PRM to progressively refine noisy labels.
arXiv Detail & Related papers (2026-01-19T06:03:58Z)
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal [84.71254539482369]
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures.<n>We present CARE, a failure-centric post-training framework for multimodal reasoning that turns errors into supervision.<n> CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures.
arXiv Detail & Related papers (2025-12-22T16:34:21Z)
AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens [9.127363793428119]
We show that short sequences of low-perplexity control tokens can flip many binary evaluations from correct No'' judgments to incorrect Yes'' judgments.<n>We show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives.
arXiv Detail & Related papers (2025-12-19T09:22:11Z)
Hard Negative Sample-Augmented DPO Post-Training for Small Language Models [4.425580048633862]
We propose a lightweight and pragmatic post-training pipeline that targets structured errors under realistic compute budgets.<n>We introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores.<n> Experiments show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO.
arXiv Detail & Related papers (2025-12-17T06:15:52Z)
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z)
ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning [72.05731026796335]
Multi-turn dialogue systems often suffer from degraded performance when exposed to low-quality data.<n>We propose ReSURE, an adaptive learning method that dynamically down-weights unreliable supervision without explicit filtering.<n>Experiments on both single-source and mixed-quality datasets show improved stability and response quality.
arXiv Detail & Related papers (2025-08-27T15:54:01Z)
Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening [36.81125165911328]
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities.<n>We investigate whether current reinforcement learning algorithms merely sharpen the base model's distribution around problems it can already solve.<n>We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings.
arXiv Detail & Related papers (2025-06-03T01:15:15Z)
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning [43.310209758380886]
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs)<n>We decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR)<n>We show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs.
arXiv Detail & Related papers (2025-06-02T06:10:54Z)
Technical report on label-informed logit redistribution for better domain generalization in low-shot classification with foundation models [3.938980910007962]
Confidence calibration is an emerging challenge in real-world decision systems based on foundations models.<n>We propose a penalty incorporated into loss objective that penalizes incorrect classifications whenever one is made during finetuning.<n>We refer to it as textitconfidence misalignment penalty (CMP).
arXiv Detail & Related papers (2025-01-29T11:54:37Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z)
Certified Adversarial Robustness Within Multiple Perturbation Bounds [38.3813286696956]
Randomized smoothing (RS) is a well known certified defense against adversarial attacks. In this work, we aim to improve the certified adversarial robustness against multiple perturbation bounds simultaneously.
arXiv Detail & Related papers (2023-04-20T16:42:44Z)
Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent [97.64313409741614]
We propose to enforce a emphconsistency property which states that predictions of the model on its own generated data are consistent across time. We show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ.
arXiv Detail & Related papers (2023-02-17T18:45:04Z)
WR-ONE2SET: Towards Well-Calibrated Keyphrase Generation [57.11538133231843]
Keyphrase generation aims to automatically generate short phrases summarizing an input document. The recently emerged ONE2SET paradigm generates keyphrases as a set and has achieved competitive performance. We propose WR-ONE2SET which extends ONE2SET with an adaptive instance-level cost Weighting strategy and a target Re-assignment mechanism.
arXiv Detail & Related papers (2022-11-13T09:56:24Z)
Training \beta-VAE by Aggregating a Learned Gaussian Posterior with a Decoupled Decoder [0.553073476964056]
Current practices in VAE training often result in a trade-off between the reconstruction fidelity and the continuity$/$disentanglement of the latent space. We present intuitions and a careful analysis of the antagonistic mechanism of the two losses, and propose a simple yet effective two-stage method for training a VAE. We evaluate the method using a medical dataset intended for 3D skull reconstruction and shape completion, and the results indicate promising generative capabilities of the VAE trained using the proposed method.
arXiv Detail & Related papers (2022-09-29T13:49:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.