Mitigating Catastrophic Forgetting in Continual Learning through Model Growth
- URL: http://arxiv.org/abs/2509.01213v1
- Date: Mon, 01 Sep 2025 07:51:31 GMT
- Title: Mitigating Catastrophic Forgetting in Continual Learning through Model Growth
- Authors: Ege Süalp, Mina Rezaei,
- Abstract summary: Catastrophic forgetting is a significant challenge in continual learning, in which a model loses prior knowledge when it is fine-tuned on new tasks.<n>This problem is particularly critical for large language models undergoing continual learning, as retaining performance across diverse domains is important for their general utility.<n>We evaluate whether growth-based pretraining can retain previously learned capabilities more effectively across a sequence of fine-tuning tasks involving domain knowledge, reasoning, reading comprehension, and bias.
- Score: 3.027322177899458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Catastrophic forgetting is a significant challenge in continual learning, in which a model loses prior knowledge when it is fine-tuned on new tasks. This problem is particularly critical for large language models (LLMs) undergoing continual learning, as retaining performance across diverse domains is important for their general utility. In this paper, we explore model growth, a promising strategy that leverages smaller models to expedite and structure the training of larger ones for mitigating the catastrophic forgetting problem. Although growth-based pretraining, particularly via transformer stacking, has shown promise in accelerating convergence, its impact on forgetting remains under-explored. Therefore, we evaluate whether growth-based models can retain previously learned capabilities more effectively across a sequence of fine-tuning tasks involving domain knowledge, reasoning, reading comprehension, and bias. Our findings show that both models -- one trained with growth (Stack LLM) and one without (LLM) -- exhibit improvements in domain knowledge. However, reasoning and reading comprehension degrade over time, indicating signs of catastrophic forgetting. Stack LLM consistently shows less degradation, especially in reading comprehension, suggesting enhanced retention capabilities. Interestingly, in bias evaluation, the baseline LLM becomes progressively more neutral with continued fine-tuning, while Stack LLM maintains a steady bias ratio around 60--61\%. These results indicate that growth-based pretraining may deliver modest improvements in resisting catastrophic forgetting, though trade-offs remain in handling social biases.
Related papers
- Why Do Neural Networks Forget: A Study of Collapse in Continual Learning [1.9345014784026022]
Catastrophic forgetting is a major problem in continual learning, and lots of approaches arise to reduce it.<n>Recent research suggests that structural collapse leads to loss of plasticity, as evidenced by changes in effective rank (eRank)<n>In this study, we investigate the correlation between forgetting and collapse through the measurement of both weight and activation eRank.
arXiv Detail & Related papers (2026-03-04T20:19:00Z) - Evolutionary Strategies lead to Catastrophic Forgetting in LLMs [51.91763220981834]
Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms.<n>ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget.<n>ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online.
arXiv Detail & Related papers (2026-01-28T18:59:34Z) - How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z) - The Importance of Being Lazy: Scaling Limits of Continual Learning [60.97756735877614]
We show that increasing model width is only beneficial when it reduces the amount of feature learning, yielding more laziness.<n>We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks.
arXiv Detail & Related papers (2025-06-20T10:12:38Z) - Temporal-Difference Variational Continual Learning [89.32940051152782]
We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.<n>Our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.
arXiv Detail & Related papers (2024-10-10T10:58:41Z) - Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models [51.20499954955646]
Large language models (LLMs) acquire vast amounts of knowledge from extensive text corpora during the pretraining phase.
In later stages such as fine-tuning and inference, the model may encounter knowledge not covered in the initial training.
We propose a two-stage fine-tuning strategy to improve the model's overall test accuracy and knowledge retention.
arXiv Detail & Related papers (2024-10-08T08:35:16Z) - Overcoming Growth-Induced Forgetting in Task-Agnostic Continual Learning [9.91929539637026]
In continual learning (CL), model growth enhances adaptability over new data, improving knowledge retention for more tasks.
However, improper model growth can lead to severe degradation of previously learned knowledge, especially in task-agnostic CL using entire grown model for inference.
This paper presents a novel SparseGrow approach to overcome the issue of GIFt while enhancing adaptability over new data.
arXiv Detail & Related papers (2024-08-20T06:05:52Z) - Investigating Continual Pretraining in Large Language Models: Insights and Implications [9.660013084324817]
Continual learning in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies.<n>We introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes.<n>Our findings uncover several key insights: (i) continual pretraining consistently improves 1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and
arXiv Detail & Related papers (2024-02-27T10:47:24Z) - An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information.<n>This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.