Does Continual Learning Equally Forget All Parameters?
- URL: http://arxiv.org/abs/2304.04158v1
- Date: Sun, 9 Apr 2023 04:36:24 GMT
- Title: Does Continual Learning Equally Forget All Parameters?
- Authors: Haiyan Zhao, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang
- Abstract summary: Distribution shift (e.g., task or domain shift) in continual learning (CL) usually results in catastrophic forgetting of neural networks.
We study which modules in neural networks are more prone to forgetting by investigating their training dynamics during CL.
We propose a more efficient and simpler method that entirely removes the every-step replay and replaces them by only $k$-times of FPF periodically triggered during CL.
- Score: 55.431048995662714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distribution shift (e.g., task or domain shift) in continual learning (CL)
usually results in catastrophic forgetting of neural networks. Although it can
be alleviated by repeatedly replaying buffered data, the every-step replay is
time-consuming. In this paper, we study which modules in neural networks are
more prone to forgetting by investigating their training dynamics during CL.
Our proposed metrics show that only a few modules are more task-specific and
sensitively alter between tasks, while others can be shared across tasks as
common knowledge. Hence, we attribute forgetting mainly to the former and find
that finetuning them only on a small buffer at the end of any CL method can
bring non-trivial improvement. Due to the small number of finetuned parameters,
such ``Forgetting Prioritized Finetuning (FPF)'' is efficient in computation.
We further propose a more efficient and simpler method that entirely removes
the every-step replay and replaces them by only $k$-times of FPF periodically
triggered during CL. Surprisingly, this ``$k$-FPF'' performs comparably to FPF
and outperforms the SOTA CL methods but significantly reduces their
computational overhead and cost. In experiments on several benchmarks of class-
and domain-incremental CL, FPF consistently improves existing CL methods by a
large margin, and $k$-FPF further excels in efficiency without degrading the
accuracy. We also empirically studied the impact of buffer size, epochs per
task, and finetuning modules on the cost and accuracy of our methods.
Related papers
- CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information [33.01180010689081]
We introduce an efficient structured pruning framework named CFSP.
We first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block.
Results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets.
arXiv Detail & Related papers (2024-09-20T04:03:27Z) - FeDeRA:Efficient Fine-tuning of Language Models in Federated Learning Leveraging Weight Decomposition [7.229494183462913]
Despite exceptional performance after fine-tuning, pre-trained language models (PLMs) face significant challenges due to privacy concerns.
We consider federated learning (FL) to fine-tune PLMs in this paper.
One promising solution is to exploit parameter-efficient fine-tuning (PEFT) into FL, which trains a much smaller set of parameters than full parameter fine-tuning (FFT)
arXiv Detail & Related papers (2024-04-29T16:42:26Z) - Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - Strong Baselines for Parameter Efficient Few-Shot Fine-tuning [50.83426196335385]
Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase.
Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC.
Fine-tuning ViTs, however, is expensive in time, compute and storage.
This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters.
arXiv Detail & Related papers (2023-04-04T16:14:39Z) - Computationally Budgeted Continual Learning: What Does Matter? [128.0827987414154]
Continual Learning (CL) aims to sequentially train models on streams of incoming data that vary in distribution by preserving previous knowledge while adapting to new data.
Current CL literature focuses on restricted access to previously seen data, while imposing no constraints on the computational budget for training.
We revisit this problem with a large-scale benchmark and analyze the performance of traditional CL approaches in a compute-constrained setting.
arXiv Detail & Related papers (2023-03-20T14:50:27Z) - Improving Representational Continuity via Continued Pretraining [76.29171039601948]
Transfer learning community (LP-FT) outperforms naive training and other continual learning methods.
LP-FT also reduces forgetting in a real world satellite remote sensing dataset (FMoW)
variant of LP-FT gets state-of-the-art accuracies on an NLP continual learning benchmark.
arXiv Detail & Related papers (2023-02-26T10:39:38Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - [Reproducibility Report] Rigging the Lottery: Making All Tickets Winners [1.6884611234933766]
$textitRigL$, a sparse training algorithm, claims to directly train sparse networks that match or exceed the performance of existing dense-to-sparse training techniques.
We implement $textitRigL$ from scratch in Pytorch and reproduce its performance on CIFAR-10 within 0.1% of the reported value.
arXiv Detail & Related papers (2021-03-29T17:01:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.