PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
- URL: http://arxiv.org/abs/2506.10406v1
- Date: Thu, 12 Jun 2025 06:59:35 GMT
- Title: PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
- Authors: Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, Lin Yan,
- Abstract summary: Policy as Generative Verifier (PAG) is a framework that empowers Large Language Models to self-correct by alternating between policy and verifier roles.<n>It alleviates model collapse and jointly enhances both reasoning and verification abilities.
- Score: 18.771754895027616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG's dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.
Related papers
- Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z) - Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency [7.806516365113592]
Large language models (LLMs) are increasingly used in applications requiring factual accuracy.<n>While fact-checking can mitigate these errors, existing methods typically retrieve external evidence indiscriminately.<n>We introduce Probabilistic Certainty and Consistency (PCC), a framework that estimates factual confidence.
arXiv Detail & Related papers (2026-01-05T21:57:41Z) - When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers [11.937771430269201]
We present a systematic study across 37 large language models (LLMs)<n>We compare self-verification with verification within the same family and across different families.<n>We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability.
arXiv Detail & Related papers (2025-12-02T00:51:14Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning [29.778703252962092]
Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs)<n>We develop a novel test-time reward mechanism that operates without external supervision.
arXiv Detail & Related papers (2025-10-20T07:53:51Z) - Tractable Asymmetric Verification for Large Language Models via Deterministic Replicability [0.6117371161379209]
The landscape of Large Language Models (LLMs) shifts rapidly towards dynamic, multi-agent systems.<n>This paper proposes a verification framework that achieves tractable asymmetric effort.<n>We show that targeted verification can be over 12 times faster than full regeneration.
arXiv Detail & Related papers (2025-09-14T03:30:06Z) - GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning [12.724393910603299]
We introduce the Generative Multimodal Process Reward Model (GM-PRM)<n>Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step.<n>We show that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks.
arXiv Detail & Related papers (2025-08-06T05:10:29Z) - Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - AlignRAG: Leveraging Critique Learning for Evidence-Sensitive Retrieval-Augmented Reasoning [61.28113271728859]
RAG has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>Standard RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>In this work, we reinterpret RAG as Retrieval-Augmented Reasoning and identify a central but underexplored problem: textitReasoning Misalignment.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification [53.80183105328448]
Refine via Intrinsic Self-Verification (ReVISE) is an efficient framework that enables LLMs to self-correct their outputs through self-verification.<n>Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
arXiv Detail & Related papers (2025-02-20T13:50:02Z) - Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models [10.449015816015566]
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference.<n>We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap.<n>We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
arXiv Detail & Related papers (2024-12-03T18:47:26Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs)
We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data.
We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z) - Derailer-Rerailer: Adaptive Verification for Efficient and Reliable Language Model Reasoning [11.765298236504155]
Derailer-Rerailer is a novel framework that balances reasoning accuracy and computational efficiency.<n>Our framework achieves significant accuracy improvements (8-11% across various reasoning tasks) while maintaining 2-3 times better efficiency than existing verification methods.
arXiv Detail & Related papers (2024-08-25T21:20:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.