Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery
- URL: http://arxiv.org/abs/2601.20193v1
- Date: Wed, 28 Jan 2026 02:43:03 GMT
- Title: Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery
- Authors: Zhipeng Zhang, Wenting Ma, Kai Li, Meng Guo, Lei Yang, Wei Yu, Hongji Cui, Yichen Zhang, Mo Zhang, Jinzhe Lin, Zhenjie Yao,
- Abstract summary: We propose a meta-cognitive reinforcement learning framework that enables an agent to assess, regulate, and recover its learning behavior.<n>The proposed method introduces a meta-trust variable driven by Value Prediction Error Stability (VPES), which modulates learning dynamics via fail-safe regulation and gradual trust recovery.
- Score: 25.522943543082363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robust reinforcement learning methods typically focus on suppressing unreliable experiences or corrupted rewards, but they lack the ability to reason about the reliability of their own learning process. As a result, such methods often either overreact to noise by becoming overly conservative or fail catastrophically when uncertainty accumulates. In this work, we propose a meta-cognitive reinforcement learning framework that enables an agent to assess, regulate, and recover its learning behavior based on internally estimated reliability signals. The proposed method introduces a meta-trust variable driven by Value Prediction Error Stability (VPES), which modulates learning dynamics via fail-safe regulation and gradual trust recovery. Experiments on continuous-control benchmarks with reward corruption demonstrate that recovery-enabled meta-cognitive control achieves higher average returns and significantly reduces late-stage training failures compared to strong robustness baselines.
Related papers
- Learning Robust Reasoning through Guided Adversarial Self-Play [32.87933476043378]
We introduce GASP (Guided Adrial Self-Play), a robustification method that explicitly trains detect-and-repair capabilities.<n>Without human labels or external teachers, GASP forms an adversarial self-play game within a single model.<n>In-distribution repair guidance, an imitation term on self-generated repairs, increases recovery probability while preserving previously acquired capabilities.
arXiv Detail & Related papers (2026-01-30T02:23:31Z) - Learning to Trust Experience: A Monitor-Trust-Regulator Framework for Learning under Unobservable Feedback Reliability [24.97566911521709]
We study Epistemic Identifiability under Unobservable Reliability (EIUR)<n>Standard robust learning can converge stably yet form high-confidence, systematically wrong beliefs.<n>We propose metacognitive regulation as a practical response: a second, introspective control loop that infers experience credibility from endogenous evidence in the learner's internal dynamics.
arXiv Detail & Related papers (2026-01-14T07:52:14Z) - Parent-Guided Adaptive Reliability (PGAR): A Behavioural Meta-Learning Framework for Stable and Trustworthy AI [0.0]
Parent-Guided Adaptive Reliability (PGAR) is a lightweight behavioural meta-learning framework.<n>It adds a supervisory "parent" layer on top of a standard learner to improve stability, calibration, and recovery under disturbances.<n>PGAR functions as a plug-in reliability layer for existing optimization and learning pipelines, supporting interpretable traces in safety-relevant settings.
arXiv Detail & Related papers (2026-01-07T06:02:34Z) - Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z) - CARIL: Confidence-Aware Regression in Imitation Learning for Autonomous Driving [0.0]
End-to-end vision-based imitation learning has demonstrated promising results in autonomous driving.<n>Traditional approaches rely on either regressionbased models, which provide precise control but lack confidence estimation, or classification-based models, which offer confidence scores but suffer from reduced precision due to discretization.<n>We introduce a dual-head neural network architecture that integrates both regression and classification heads to improve decision reliability in imitation learning.
arXiv Detail & Related papers (2025-03-02T08:19:02Z) - Temporal-Difference Variational Continual Learning [77.92320830700797]
We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.<n>Our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.
arXiv Detail & Related papers (2024-10-10T10:58:41Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning [11.084321518414226]
We adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of so-called hindsight policy methods.
Our hindsight distribution correction facilitates stable, efficient learning across a broad range of environments where credit assignment plagues baseline methods.
arXiv Detail & Related papers (2023-07-21T20:54:52Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Closing the Closed-Loop Distribution Shift in Safe Imitation Learning [80.05727171757454]
We treat safe optimization-based control strategies as experts in an imitation learning problem.
We train a learned policy that can be cheaply evaluated at run-time and that provably satisfies the same safety guarantees as the expert.
arXiv Detail & Related papers (2021-02-18T05:11:41Z) - An evaluation of word-level confidence estimation for end-to-end
automatic speech recognition [70.61280174637913]
We investigate confidence estimation for end-to-end automatic speech recognition (ASR)
We provide an extensive benchmark of popular confidence methods on four well-known speech datasets.
Our results suggest a strong baseline can be obtained by scaling the logits by a learnt temperature.
arXiv Detail & Related papers (2021-01-14T09:51:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.