Learning Robust Reasoning through Guided Adversarial Self-Play
- URL: http://arxiv.org/abs/2602.00173v1
- Date: Fri, 30 Jan 2026 02:23:31 GMT
- Title: Learning Robust Reasoning through Guided Adversarial Self-Play
- Authors: Shuozhe Li, Vaishnav Tadiparthi, Kwonjoon Lee, Nakul Agarwal, Hossein Nourkhiz Mahjoub, Ehsan Moradi Pari, Lizhang Chen, Amy Zhang, Liu Leqi,
- Abstract summary: We introduce GASP (Guided Adrial Self-Play), a robustification method that explicitly trains detect-and-repair capabilities.<n>Without human labels or external teachers, GASP forms an adversarial self-play game within a single model.<n>In-distribution repair guidance, an imitation term on self-generated repairs, increases recovery probability while preserving previously acquired capabilities.
- Score: 32.87933476043378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning from verifiable rewards (RLVR) produces strong reasoning models, yet they can fail catastrophically when the conditioning context is fallible (e.g., corrupted chain-of-thought, misleading partial solutions, or mild input perturbations), since standard RLVR optimizes final-answer correctness only under clean conditioning. We introduce GASP (Guided Adversarial Self-Play), a robustification method that explicitly trains detect-and-repair capabilities using only outcome verification. Without human labels or external teachers, GASP forms an adversarial self-play game within a single model: a polluter learns to induce failure via locally coherent corruptions, while an agent learns to diagnose and recover under the same corrupted conditioning. To address the scarcity of successful recoveries early in training, we propose in-distribution repair guidance, an imitation term on self-generated repairs that increases recovery probability while preserving previously acquired capabilities. Across four open-weight models (1.5B--8B), GASP transforms strong-but-brittle reasoners into robust ones that withstand misleading and perturbed context while often improving clean accuracy. Further analysis shows that adversarial corruptions induce an effective curriculum, and in-distribution guidance enables rapid recovery learning with minimal representational drift.
Related papers
- QuAIL: Quality-Aware Inertial Learning for Robust Training under Data Corruption [7.630511612007769]
We present QuAIL, a quality-informed training mechanism that incorporates feature reliability priors directly into the learning process.<n>We show that QuAIL consistently improves average performance over neural baselines under both random and value-dependent corruption.
arXiv Detail & Related papers (2026-02-03T16:06:30Z) - Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning [0.3823356975862005]
We propose failure- conditioning, a simple and effective method for learning from saturated problems.<n>We observe that failure-prone conditioning yields performance gains matching those of training on medium-difficulty problems.<n>Our results suggest that failure- conditioning offers an effective pathway to extend RLVR training on saturated problems.
arXiv Detail & Related papers (2026-01-28T18:29:21Z) - Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery [25.522943543082363]
We propose a meta-cognitive reinforcement learning framework that enables an agent to assess, regulate, and recover its learning behavior.<n>The proposed method introduces a meta-trust variable driven by Value Prediction Error Stability (VPES), which modulates learning dynamics via fail-safe regulation and gradual trust recovery.
arXiv Detail & Related papers (2026-01-28T02:43:03Z) - CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal [84.71254539482369]
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures.<n>We present CARE, a failure-centric post-training framework for multimodal reasoning that turns errors into supervision.<n> CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures.
arXiv Detail & Related papers (2025-12-22T16:34:21Z) - Large Reasoning Models Learn Better Alignment from Flawed Thinking [56.08883934423522]
Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer.<n>We propose RECAP, a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T14:15:43Z) - Can Large Reasoning Models Self-Train? [51.0277533541394]
We use majority voting as a simple self-feedback mechanism to study whether self-training can be sustained within reinforcement learning.<n>We find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration.<n>Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking, resulting in sudden and complete performance collapse.
arXiv Detail & Related papers (2025-05-27T17:16:00Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - LaMOuR: Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning [16.093659272414527]
We introduce Language Models for Out-of-Distribution Recovery (LaMOuR), which enables recovery learning without relying on uncertainty estimation.<n>LaMOuR generates dense reward codes that guide the agent back to a state where it can successfully perform its original task.<n> Experimental results show that LaMOuR substantially enhances recovery efficiency across diverse locomotion tasks.
arXiv Detail & Related papers (2025-03-21T13:20:39Z) - CARIL: Confidence-Aware Regression in Imitation Learning for Autonomous Driving [0.0]
End-to-end vision-based imitation learning has demonstrated promising results in autonomous driving.<n>Traditional approaches rely on either regressionbased models, which provide precise control but lack confidence estimation, or classification-based models, which offer confidence scores but suffer from reduced precision due to discretization.<n>We introduce a dual-head neural network architecture that integrates both regression and classification heads to improve decision reliability in imitation learning.
arXiv Detail & Related papers (2025-03-02T08:19:02Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - Annealing Self-Distillation Rectification Improves Adversarial Training [0.10241134756773226]
We analyze the characteristics of robust models and identify that robust models tend to produce smoother and well-calibrated outputs.
We propose Annealing Self-Distillation Rectification, which generates soft labels as a better guidance mechanism.
We demonstrate the efficacy of ADR through extensive experiments and strong performances across datasets.
arXiv Detail & Related papers (2023-05-20T06:35:43Z) - Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness.
We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z) - Corruption-robust exploration in episodic reinforcement learning [76.19192549843727]
We study multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system.
Our framework yields efficient algorithms which attain near-optimal regret in the absence of corruptions.
Notably, our work provides the first sublinear regret guarantee which any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
arXiv Detail & Related papers (2019-11-20T03:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.