Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient
Tuning
- URL: http://arxiv.org/abs/2402.18865v1
- Date: Thu, 29 Feb 2024 05:27:45 GMT
- Title: Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient
Tuning
- Authors: Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, Wei Qin
- Abstract summary: Large language models (LLMs) exhibit remarkable performance in language understanding and generation.
LLMs are continuously fine-tuned on complex and diverse domain-specific downstream tasks.
A trade-off needs to be kept between learning plasticity and memory stability.
- Score: 9.38259062204602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing research has shown that large language models (LLMs) exhibit
remarkable performance in language understanding and generation. However, when
LLMs are continuously fine-tuned on complex and diverse domain-specific
downstream tasks, the inference performance on historical tasks decreases
dramatically, which is known as a catastrophic forgetting problem. A trade-off
needs to be kept between learning plasticity and memory stability. Plenty of
existing works have explored strategies like memory replay, regularization and
parameter isolation, but little is known about the geometric connection of
various adjacent minima in the continual LLMs fine-tuning scenarios. In this
work, we investigate the geometric connections of different minima through the
lens of mode connectivity, which means different minima can be connected by a
low-loss valley. Through extensive experiments, we uncover the mode
connectivity phenomenon in the LLMs continual learning scenario and find that
it can strike a balance between plasticity and stability. Building upon these
findings, we propose a simple yet effective method called Interpolation-based
LoRA (I-LoRA), which constructs a dual-memory experience replay framework based
on LoRA parameter interpolations. Extensive experiments and analysis on eight
domain-specific CL benchmarks demonstrate that I-LoRA consistently show
significant improvement over the previous state-of-the-art approaches with up
to $11\%$ performance gains, providing a strong baseline and insights for
future research on the large language model continual learning problem. Our
code is available at \url{https://github.com/which47/LLMCL}.
Related papers
- Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models [38.97142043836567]
Continual learning (CL) aims to enable vision transformers (ViTs) to learn new tasks over time.
catastrophic forgetting remains a persistent challenge.
We propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA)
arXiv Detail & Related papers (2024-11-01T14:28:39Z) - Is Parameter Collision Hindering Continual Learning in LLMs? [50.57658782050275]
Large Language Models (LLMs) often suffer from catastrophic forgetting when learning multiple tasks sequentially.
We show that building non-collision parameters is a more critical interdependence factor in addressing CL challenges.
We propose Non-collision Low-Rank Adaptation (N-LoRA), a simple yet effective approach leveraging low collision rates to enhance CL in LLMs.
arXiv Detail & Related papers (2024-10-14T05:54:11Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Surgical Feature-Space Decomposition of LLMs: Why, When and How? [8.826164604720738]
We empirically study the efficacy of weight and feature space decomposition in transformer-based language models.
We show that surgical decomposition provides critical insights into the trade-off between compression and language modelling performance.
We extend our investigation to the implications of low-rank approximations on model bias.
arXiv Detail & Related papers (2024-05-17T07:34:03Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - CRaSh: Clustering, Removing, and Sharing Enhance Fine-tuning without
Full Large Language Model [22.870512676002463]
This paper focuses on Offsite-Tuning (OFT), a representative technique that transfers transformer blocks between centralized LLMs and downstream emulators.
Inspired by these observations, we propose CRaSh, involving Clustering, Removing, and Sharing, a training-free strategy to derive improved emulators from LLMs.
Our findings demonstrate a linear connectivity among these optima falling over the same basin, thereby highlighting the effectiveness of CRaSh and OFT.
arXiv Detail & Related papers (2023-10-24T03:08:58Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.