Related papers: ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation

ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation

URL: http://arxiv.org/abs/2602.05472v1
Date: Thu, 05 Feb 2026 09:20:23 GMT
Title: ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation
Authors: Yiwen Duan, Jing Ye, Xinpei Zhao,
Abstract summary: We introduce textbfALIVE (emphAdrial Learning with Instructive Verbal Evaluation), a hands-free alignment framework.<n>By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora.<n>With identical data and compute, ALIVE achieves markedly improved cross-domain generalization, and higher self-correction rates.
Score: 4.265094703231012
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textit{reward bottleneck}: traditional reinforcement learning (RL) relies on scalar rewards that are \textbf{costly} to scale, \textbf{brittle} across domains, and \textbf{blind} to the underlying logic of a solution. This reliance on external, impoverished signals prevents models from developing a deep, self-contained understanding of reasoning principles. We introduce \textbf{ALIVE} (\emph{Adversarial Learning with Instructive Verbal Evaluation}), a hands-free alignment framework that moves beyond scalar reward optimization toward intrinsic reasoning acquisition. Grounded in the principle of \emph{Cognitive Synergy}, ALIVE unifies problem posing, solving, and judging within a single policy model to internalize the logic of correctness. By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora, effectively transforming external critiques into an endogenous reasoning faculty. Empirical evaluations across mathematical reasoning, code generation, and general logical inference benchmarks demonstrate that ALIVE consistently mitigates reward signal limitations. With identical data and compute, it achieves accuracy gains, markedly improved cross-domain generalization, and higher self-correction rates. These results indicate that the reasoning trinity fosters a self-sustaining trajectory of capability growth, positioning ALIVE as a scalable foundation for general-purpose reasoning alignment without human-in-the-loop supervision.

Related papers

LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval [74.72139580745511]
LaSER is a novel self-distillation framework that internalizes explicit reasoning into the latent space of retrievers.<n>Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
arXiv Detail & Related papers (2026-03-02T04:11:18Z)
Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z)
Native Reasoning Models: Training Language Models to Reason on Unverifiable Data [16.065264121785294]
We introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning.<n>NRT reframes the training problem by treating the reasoning process as a latent variable.<n>NRT achieves state-of-the-art performance among verifier-free methods.
arXiv Detail & Related papers (2026-02-12T04:15:46Z)
Reflective Confidence: Correcting Reasoning Flaws via Online Self-Correction [14.164508061248775]
Large language models (LLMs) have achieved strong performance on complex reasoning tasks using techniques such as chain-of-thought and self-consistency.<n>We propose reflective confidence, a novel reasoning framework that transforms low-confidence signals from termination indicators into reflection triggers.<n> Experiments on mathematical reasoning benchmarks, including AIME 2025, demonstrate significant accuracy improvements over advanced early-stopping baselines at comparable computational cost.
arXiv Detail & Related papers (2025-12-21T05:35:07Z)
Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning [47.867294403474176]
We propose Stepwise Think-Critique, a unified framework that interleaves reasoning and self-critique at each step within a single model.<n> STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation.
arXiv Detail & Related papers (2025-12-17T18:15:17Z)
Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization [22.301471821413816]
Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs)<n>We propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in conditional generation.
arXiv Detail & Related papers (2025-11-24T13:55:57Z)
Cognitive Inception: Agentic Reasoning against Visual Deceptions by Injecting Skepticism [81.39177645864757]
We propose textbfInception, a fully reasoning-based agentic reasoning framework to conduct authenticity verification by injecting skepticism.<n>To the best of our knowledge, this is the first fully reasoning-based framework against AIGC visual deceptions.
arXiv Detail & Related papers (2025-11-21T05:13:30Z)
In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z)
Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning [27.42733470720954]
We propose a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse.<n>Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance.
arXiv Detail & Related papers (2025-11-12T11:34:19Z)
From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z)
Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z)
PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.