Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
- URL: http://arxiv.org/abs/2510.18874v1
- Date: Tue, 21 Oct 2025 17:59:41 GMT
- Title: Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
- Authors: Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen,
- Abstract summary: Adapting language models to new tasks via post-training carries the risk of degrading existing capabilities.<n>We compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL)<n>RL leads to less forgetting than SFT while achieving comparable or higher target task performance.
- Score: 40.80967570661867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
Related papers
- On Discovering Algorithms for Adversarial Imitation Learning [28.812210809286086]
We present emphDiscovered Adversarial Imitation Learning (DAIL), the first meta-learnt AIL algorithm.<n>We show that DAIL generalises across unseen environments and policy optimization algorithms.<n>We also analyse why DAIL leads to more stable training, offering novel insights into the role of RA functions in the stability of AIL.
arXiv Detail & Related papers (2025-10-01T14:02:05Z) - One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient [16.05489579792086]
We introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method.<n>OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory.<n>Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs.
arXiv Detail & Related papers (2025-09-30T14:25:56Z) - Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends [64.71326476563213]
Off-policy reinforcement learning for large language models (LLMs) is attracting growing interest.<n>We present a first-principles derivation for grouprelative REINFORCE without assuming a specific training data distribution.<n>This perspective yields two general principles for adapting REINFORCE to off-policy settings.
arXiv Detail & Related papers (2025-09-29T02:34:54Z) - A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning [45.19254609437857]
Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability.<n>Data attribution provides a principled way to trace model behavior back to training samples.<n>We propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates.
arXiv Detail & Related papers (2025-05-25T19:25:57Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data [17.991833729722288]
We propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL)
Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function.
We provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.
arXiv Detail & Related papers (2024-03-18T14:51:19Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.