Related papers: Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

URL: http://arxiv.org/abs/2510.21978v1
Date: Fri, 24 Oct 2025 19:08:48 GMT
Title: Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models
Authors: Hoang Phan, Xianjun Yang, Kevin Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei,
Abstract summary: Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning.<n>We propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge.<n>Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning.
Score: 33.214586668992965
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, where models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are calculated on the current task, thus they do not guarantee broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training focus each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts in an online manner using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.

Related papers

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z)
Native Reasoning Models: Training Language Models to Reason on Unverifiable Data [16.065264121785294]
We introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning.<n>NRT reframes the training problem by treating the reasoning process as a latent variable.<n>NRT achieves state-of-the-art performance among verifier-free methods.
arXiv Detail & Related papers (2026-02-12T04:15:46Z)
Reinforcement Learning with Backtracking Feedback [12.680874918250069]
We introduce Reinforcement Learning with Backtracking Feedback (RLBF)<n>This framework advances upon prior methods, such as BSAFE.<n>We show that RLBF significantly reduces attack success rates across diverse benchmarks and model scales.
arXiv Detail & Related papers (2026-02-09T08:23:19Z)
Probing RLVR training instability through the lens of objective-level hacking [46.64585260377202]
We introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking.<n>Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic.<n>These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.
arXiv Detail & Related papers (2026-02-01T08:55:27Z)
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration [8.839121572048018]
We propose RAPO, an algorithm to promote broader yet focused exploration.<n>We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset.<n>Results show that RAPO consistently improves problem-solving performance.
arXiv Detail & Related papers (2025-10-04T16:22:19Z)
ExGRPO: Learning to Reason from Experience [82.83309610498446]
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models.<n>Standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability.<n>In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value.
arXiv Detail & Related papers (2025-10-02T17:31:30Z)
Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z)
CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z)
The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward [57.56453588632619]
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance.<n>This is often accompanied by catastrophic forgetting, where models lose previously acquired skills.<n>We argue that standard RLVR objectives lack a crucial mechanism for knowledge retention.
arXiv Detail & Related papers (2025-09-09T06:34:32Z)
RL as Regressor: A Reinforcement Learning Approach for Function Approximation [0.0]
We propose framing regression as a Reinforcement Learning (RL) problem.<n>We demonstrate this by treating a model's prediction as an action and defining a custom reward signal based on the prediction error.<n>We show that the RL framework not only successfully solves the regression problem but also offers enhanced flexibility in defining objectives and guiding the learning process.
arXiv Detail & Related papers (2025-07-31T21:39:24Z)
Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training [36.69514399442043]
This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT)<n>Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks.
arXiv Detail & Related papers (2025-07-07T18:17:06Z)
Temporal-Difference Variational Continual Learning [77.92320830700797]
We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.<n>Our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.
arXiv Detail & Related papers (2024-10-10T10:58:41Z)
SAMBO-RL: Shifts-aware Model-based Offline Reinforcement Learning [9.88109749688605]
Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models.<n>This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift.<n>We introduce Shifts-aware Model-based Offline Reinforcement Learning (SAMBO-RL), a practical framework that efficiently trains classifiers to approximate SAR for policy optimization.
arXiv Detail & Related papers (2024-08-23T04:25:09Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.