Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient
Tuning
- URL: http://arxiv.org/abs/2402.18865v1
- Date: Thu, 29 Feb 2024 05:27:45 GMT
- Title: Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient
Tuning
- Authors: Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, Wei Qin
- Abstract summary: Large language models (LLMs) exhibit remarkable performance in language understanding and generation.
LLMs are continuously fine-tuned on complex and diverse domain-specific downstream tasks.
A trade-off needs to be kept between learning plasticity and memory stability.
- Score: 9.38259062204602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing research has shown that large language models (LLMs) exhibit
remarkable performance in language understanding and generation. However, when
LLMs are continuously fine-tuned on complex and diverse domain-specific
downstream tasks, the inference performance on historical tasks decreases
dramatically, which is known as a catastrophic forgetting problem. A trade-off
needs to be kept between learning plasticity and memory stability. Plenty of
existing works have explored strategies like memory replay, regularization and
parameter isolation, but little is known about the geometric connection of
various adjacent minima in the continual LLMs fine-tuning scenarios. In this
work, we investigate the geometric connections of different minima through the
lens of mode connectivity, which means different minima can be connected by a
low-loss valley. Through extensive experiments, we uncover the mode
connectivity phenomenon in the LLMs continual learning scenario and find that
it can strike a balance between plasticity and stability. Building upon these
findings, we propose a simple yet effective method called Interpolation-based
LoRA (I-LoRA), which constructs a dual-memory experience replay framework based
on LoRA parameter interpolations. Extensive experiments and analysis on eight
domain-specific CL benchmarks demonstrate that I-LoRA consistently show
significant improvement over the previous state-of-the-art approaches with up
to $11\%$ performance gains, providing a strong baseline and insights for
future research on the large language model continual learning problem. Our
code is available at \url{https://github.com/which47/LLMCL}.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding [27.004817441034795]
Collaborative decoding between large and small language models (SLMs) offers a novel approach to address these challenges.
Inspired by dual-process cognitive theory, we integrate these methods into a unified framework termed Fast and Slow Generating (FS-GEN)
This paper explores several techniques within the FS-GEN framework, including speculative decoding, contrastive decoding, and emulator or proxy fine-tuning.
arXiv Detail & Related papers (2024-06-18T05:59:28Z) - Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.
We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems.
Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z) - Surgical Feature-Space Decomposition of LLMs: Why, When and How? [8.826164604720738]
We empirically study the efficacy of weight and feature space decomposition in transformer-based language models.
We show that surgical decomposition provides critical insights into the trade-off between compression and language modelling performance.
We extend our investigation to the implications of low-rank approximations on model bias.
arXiv Detail & Related papers (2024-05-17T07:34:03Z) - Efficient Learnable Collaborative Attention for Single Image Super-Resolution [18.955369476815136]
Non-Local Attention (NLA) is a powerful technique for capturing long-range feature correlations in deep single image super-resolution (SR)
We propose a novel Learnable Collaborative Attention (LCoA) that introduces inductive bias into non-local modeling.
Our LCoA can reduce the non-local modeling time by about 83% in the inference stage.
arXiv Detail & Related papers (2024-04-07T11:25:04Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - CRaSh: Clustering, Removing, and Sharing Enhance Fine-tuning without
Full Large Language Model [22.870512676002463]
This paper focuses on Offsite-Tuning (OFT), a representative technique that transfers transformer blocks between centralized LLMs and downstream emulators.
Inspired by these observations, we propose CRaSh, involving Clustering, Removing, and Sharing, a training-free strategy to derive improved emulators from LLMs.
Our findings demonstrate a linear connectivity among these optima falling over the same basin, thereby highlighting the effectiveness of CRaSh and OFT.
arXiv Detail & Related papers (2023-10-24T03:08:58Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Exploring Mode Connectivity for Pre-trained Language Models [91.33378704580295]
We study how to effectively adapt pre-trained language models (PLMs) to high-performance minima.
In this paper, we investigate the geometric connections of different minima through the lens of mode connectivity.
arXiv Detail & Related papers (2022-10-25T15:40:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.