Revisiting Catastrophic Forgetting in Large Language Model Tuning
- URL: http://arxiv.org/abs/2406.04836v1
- Date: Fri, 7 Jun 2024 11:09:13 GMT
- Title: Revisiting Catastrophic Forgetting in Large Language Model Tuning
- Authors: Hongyu Li, Liang Ding, Meng Fang, Dacheng Tao,
- Abstract summary: Catastrophic Forgetting (CF) means models forgetting previously acquired knowledge when learning new data.
This paper takes the first step to reveal the direct link between the flatness of the model loss landscape and the extent of CF in the field of large language models.
Experiments on three widely-used fine-tuning datasets, spanning different model scales, demonstrate the effectiveness of our method in alleviating CF.
- Score: 79.70722658190097
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Catastrophic Forgetting (CF) means models forgetting previously acquired knowledge when learning new data. It compromises the effectiveness of large language models (LLMs) during fine-tuning, yet the underlying causes have not been thoroughly investigated. This paper takes the first step to reveal the direct link between the flatness of the model loss landscape and the extent of CF in the field of LLMs. Based on this, we introduce the sharpness-aware minimization to mitigate CF by flattening the loss landscape. Experiments on three widely-used fine-tuning datasets, spanning different model scales, demonstrate the effectiveness of our method in alleviating CF. Analyses show that we nicely complement the existing anti-forgetting strategies, further enhancing the resistance of LLMs to CF.
Related papers
- Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study [2.7731115923558143]
Large Language Models (LLMs) have shown remarkable performance in NLP tasks, but their efficacy in generating high-quality Counterfactuals (CFs) remains uncertain.
We compare several common LLMs and evaluate their CFs, assessing both intrinsic metrics, and the impact of these CFs on data augmentation.
Our results show that LLMs generate fluent CFs, but struggle to keep the induced changes minimal.
arXiv Detail & Related papers (2024-04-26T11:57:21Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - Faithful Explanations of Black-box NLP Models Using LLM-generated
Counterfactuals [67.64770842323966]
Causal explanations of predictions of NLP systems are essential to ensure safety and establish trust.
Existing methods often fall short of explaining model predictions effectively or efficiently.
We propose two approaches for counterfactual (CF) approximation.
arXiv Detail & Related papers (2023-10-01T07:31:04Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - DualCF: Efficient Model Extraction Attack from Counterfactual
Explanations [57.46134660974256]
Cloud service providers have launched Machine-Learning-as-a-Service platforms to allow users to access large-scale cloudbased models via APIs.
Such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks.
We propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model.
arXiv Detail & Related papers (2022-05-13T08:24:43Z) - Impact of Channel Variation on One-Class Learning for Spoof Detection [5.549602650463701]
Spoofing detection increases the reliability of the ASV system but degrades significantly due to channel variation.
"Which data-feeding strategy is optimal for MCT?" is not known in the case of spoof detection.
This study highlights the relevance of the deemed-of-low-importance process of data-feeding and mini-batching to raise awareness of the need to refine it for better performance.
arXiv Detail & Related papers (2021-09-30T07:56:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.