Related papers: Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

URL: http://arxiv.org/abs/2601.18699v1
Date: Mon, 26 Jan 2026 17:15:10 GMT
Title: Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
Authors: Olaf Yunus Laitinen Imanov,
Abstract summary: Large language models exhibit remarkable performance across diverse tasks through pre-training and fine-tuning paradigms.<n>Continual fine-tuning on sequential tasks induces catastrophic forgetting, where newly acquired knowledge interferes with previously learned capabilities.<n>We identify three primary mechanisms driving forgetting: gradient interference in attention weights, representational drift in intermediate layers, and loss landscape flattening.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models exhibit remarkable performance across diverse tasks through pre-training and fine-tuning paradigms. However, continual fine-tuning on sequential tasks induces catastrophic forgetting, where newly acquired knowledge interferes with previously learned capabilities. Despite widespread observations of this phenomenon, the mechanistic understanding remains limited. Here, we present a comprehensive mechanistic analysis of catastrophic forgetting in transformer-based LLMs during sequential fine-tuning. Through systematic experiments across multiple model scales (109B to 400B total parameters) and task sequences, we identify three primary mechanisms driving forgetting: gradient interference in attention weights, representational drift in intermediate layers, and loss landscape flattening. We demonstrate that forgetting severity correlates strongly with task similarity (Pearson r = 0.87) and gradient alignment metrics. Our analysis reveals that approximately 15 to 23 percent of attention heads undergo severe disruption during fine-tuning, with lower layers showing greater susceptibility. These findings establish mechanistic foundations for developing targeted mitigation strategies in continual learning systems.

Related papers

Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z)
Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows [18.240532888032394]
We present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows.<n>We benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks.<n>We identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically
arXiv Detail & Related papers (2025-12-21T05:04:13Z)
Explaining Grokking and Information Bottleneck through Neural Collapse Emergence [33.22494588674352]
We present a unified explanation of grokking and the information bottleneck principle through the lens of neural collapse.<n>We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck.<n>By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena.
arXiv Detail & Related papers (2025-09-25T07:17:41Z)
Spectral Insights into Data-Oblivious Critical Layers in Large Language Models [7.486925126518052]
We introduce a data-oblivious approach to identify intrinsic critical layers in pre-fine-tuned language models.<n>We show that layers with significant shifts in representation space are also those most affected during fine-tuning.
arXiv Detail & Related papers (2025-05-31T04:21:39Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
On Multi-Stage Loss Dynamics in Neural Networks: Mechanisms of Plateau and Descent Stages [1.5235340620594793]
We identify three distinct stages observed in the loss curve during training: the initial plateau stage, the initial descent stage, and the secondary plateau stage. Through rigorous analysis, we reveal the underlying challenges contributing to slow training during the plateau stages.
arXiv Detail & Related papers (2024-10-26T08:16:00Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Small-scale proxies for large-scale Transformer training instabilities [69.36381318171338]
We seek ways to reproduce and study training stability and instability at smaller scales. By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates. We study methods such as warm-up, weight decay, and the $mu$Param to train small models that achieve similar losses across orders of magnitude of learning rate variation.
arXiv Detail & Related papers (2023-09-25T17:48:51Z)
Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics [24.57617154267565]
We investigate how forgetting affects representations in neural network models. We find that deeper layers are disproportionately the source of forgetting. We also introduce a novel CIFAR-100 based task approximating realistic input distribution shift.
arXiv Detail & Related papers (2020-07-14T23:31:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.