SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models
- URL: http://arxiv.org/abs/2401.08295v3
- Date: Thu, 6 Jun 2024 12:02:51 GMT
- Title: SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models
- Authors: Weixiang Zhao, Shilong Wang, Yulin Hu, Yanyan Zhao, Bing Qin, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che,
- Abstract summary: Continual learning (CL) ability is vital for deploying large language models (LLMs) in the dynamic world.
Existing methods devise the learning module to acquire task-specific knowledge with parameter-efficient tuning (PET) block and the selection module to pick out the corresponding one for the testing input.
We propose a novel Shared Attention Framework (SAPT) to align the PET learning and selection via the Shared Attentive Learning & Selection module.
- Score: 71.78800549517298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The continual learning (CL) ability is vital for deploying large language models (LLMs) in the dynamic world. Existing methods devise the learning module to acquire task-specific knowledge with parameter-efficient tuning (PET) block and the selection module to pick out the corresponding one for the testing input, aiming at handling the challenges of catastrophic forgetting and knowledge transfer in CL. However, these methods tend to address only one of the challenges, ignoring the potential of aligning the two modules to effectively address catastrophic forgetting and knowledge transfer simultaneously. To this end, we propose a novel Shared Attention Framework (SAPT), to align the PET learning and selection via the Shared Attentive Learning \& Selection module. Extensive Experiments on two CL benchmarks demonstrate the superiority of SAPT. Moreover, SAPT consistently demonstrates its superiority when we scale it to different model sizes (from 770M to 13B), different model architectures (T5 and LLaMA-2) and unseen tasks.
Related papers
- Learning to Route for Dynamic Adapter Composition in Continual Learning with Language Models [56.93608812478369]
We present L2R, a method that isolates the training of new PEFT modules to ensure their task specialization.
L2R then learns to compose the learned modules by training a network of routers that leverages a small memory containing examples of previously seen tasks.
Our results demonstrate that L2R provides an effective composition of PEFT modules, leading to improved generalization and performance compared to other methods.
arXiv Detail & Related papers (2024-08-16T23:57:29Z) - TaSL: Task Skill Localization and Consolidation for Language Model Continual Learning [41.28933724210434]
Language model continual learning (CL) has recently attracted significant interest for its ability to adapt large language models (LLMs) to dynamic real-world scenarios without retraining.
Existing approaches commonly utilize multiple parameter-efficient fine-tuning (PEFT) blocks to acquire task-specific knowledge, yet these methods are inefficient and fail to leverage potential knowledge transfer across tasks.
We introduce a novel CL framework for language models, named Task Skill Localization and Consolidation (TaSL), which boosts knowledge transfer without depending on memory replay.
arXiv Detail & Related papers (2024-08-09T17:44:45Z) - Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks.
Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z) - Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation [59.37775534633868]
We present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs.
We also propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity.
arXiv Detail & Related papers (2024-03-27T17:50:00Z) - Interactive Continual Learning: Fast and Slow Thinking [19.253164551254734]
This paper presents a novel Interactive Continual Learning framework, enabled by collaborative interactions among models of various sizes.
To improve memory retrieval in System1, we introduce the CL-vMF mechanism, based on the von Mises-Fisher (vMF) distribution.
Comprehensive evaluation of our proposed ICL demonstrates significant resistance to forgetting and superior performance relative to existing methods.
arXiv Detail & Related papers (2024-03-05T03:37:28Z) - ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot
End-to-End Temporal Action Detection [10.012716326383567]
Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos.
We present ZEETAD, featuring two modules: dual-localization and zero-shot proposal classification.
We enhance discriminative capability on unseen classes by minimally updating the frozen CLIP encoder with lightweight adapters.
arXiv Detail & Related papers (2023-11-01T00:17:37Z) - FedYolo: Augmenting Federated Learning with Pretrained Transformers [61.56476056444933]
In this work, we investigate pretrained transformers (PTF) to achieve on-device learning goals.
We show that larger scale shrinks the accuracy gaps between alternative approaches and improves robustness.
Finally, it enables clients to solve multiple unrelated tasks simultaneously using a single PTF.
arXiv Detail & Related papers (2023-07-10T21:08:52Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Class-Specific Channel Attention for Few-Shot Learning [16.019616787091202]
Few-Shot Learning has attracted growing attention in computer vision due to its capability in model training without the need for excessive data.
Conventional transfer-based solutions that aim to transfer knowledge learned from large labeled training sets to target testing sets are limited.
We propose Class-Specific Channel Attention (CSCA) module, which learns to highlight the discriminative channels in each class by assigning each class one CSCA weight vector.
arXiv Detail & Related papers (2022-09-03T05:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.