Learning to Self-Verify Makes Language Models Better Reasoners
- URL: http://arxiv.org/abs/2602.07594v1
- Date: Sat, 07 Feb 2026 15:49:06 GMT
- Title: Learning to Self-Verify Makes Language Models Better Reasoners
- Authors: Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua,
- Abstract summary: Large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks.<n>Despite powerful generation ability, LLMs remain weak at verifying their own answers.<n>We show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification.
- Score: 65.75109817173315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.
Related papers
- Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models [23.128973540926552]
Endogenous Reprompting transforms the model's understanding into an explicit generative reasoning step.<n>We show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality.
arXiv Detail & Related papers (2026-01-28T06:54:36Z) - Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs [46.43090277452948]
We show that unified MLLMs exhibit an internal gap with understanding outperforming generation.<n>This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework.<n>We empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training.
arXiv Detail & Related papers (2025-07-22T14:56:39Z) - Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement [101.77467538102924]
Large reasoning models (LRMs) exhibit overthinking, which hinders efficiency and inflates inference cost.<n>We propose two lightweight methods to enhance LRM efficiency.<n>First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction.<n>Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity.
arXiv Detail & Related papers (2025-06-18T17:18:12Z) - Can Large Reasoning Models Self-Train? [51.0277533541394]
We use majority voting as a simple self-feedback mechanism to study whether self-training can be sustained within reinforcement learning.<n>We find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration.<n>Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking, resulting in sudden and complete performance collapse.
arXiv Detail & Related papers (2025-05-27T17:16:00Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution [61.80716438091887]
GenDiE (Generate, Discriminate, Evolve) is a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization.<n>By treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches.<n>Experiments on ASQA (in-domain LFQA) and ConFiQA datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness.
arXiv Detail & Related papers (2025-03-03T16:08:33Z) - ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification [53.80183105328448]
Refine via Intrinsic Self-Verification (ReVISE) is an efficient framework that enables LLMs to self-correct their outputs through self-verification.<n>Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
arXiv Detail & Related papers (2025-02-20T13:50:02Z) - Evolving LLMs' Self-Refinement Capability via Synergistic Training-Inference Optimization [53.93621974137829]
Self-Refinement refers to a model's ability to revise its own responses to produce improved outputs.<n>EVOLVE is a framework for eliciting and tracking the evolution of Self-Refinement through iterative training.<n>We demonstrate the potential of leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic model abilities.
arXiv Detail & Related papers (2025-02-08T15:21:55Z) - Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models [10.449015816015566]
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference.<n>We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap.<n>We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
arXiv Detail & Related papers (2024-12-03T18:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.