Challenging Common Assumptions about Catastrophic Forgetting
- URL: http://arxiv.org/abs/2207.04543v2
- Date: Mon, 15 May 2023 22:27:14 GMT
- Title: Challenging Common Assumptions about Catastrophic Forgetting
- Authors: Timoth\'ee Lesort, Oleksiy Ostapenko, Diganta Misra, Md Rifat Arefin,
Pau Rodr\'iguez, Laurent Charlin, Irina Rish
- Abstract summary: We study the progressive knowledge accumulation (KA) in DNNs trained with gradient-based algorithms in long sequences of tasks with data re-occurrence.
We propose a new framework, SCoLe, to investigate KA and discover that catastrophic forgetting has a limited effect on DNNs trained with SGD.
- Score: 13.1202659074346
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building learning agents that can progressively learn and accumulate
knowledge is the core goal of the continual learning (CL) research field.
Unfortunately, training a model on new data usually compromises the performance
on past data. In the CL literature, this effect is referred to as catastrophic
forgetting (CF). CF has been largely studied, and a plethora of methods have
been proposed to address it on short sequences of non-overlapping tasks. In
such setups, CF always leads to a quick and significant drop in performance in
past tasks. Nevertheless, despite CF, recent work showed that SGD training on
linear models accumulates knowledge in a CL regression setup. This phenomenon
becomes especially visible when tasks reoccur. We might then wonder if DNNs
trained with SGD or any standard gradient-based optimization accumulate
knowledge in such a way. Such phenomena would have interesting consequences for
applying DNNs to real continual scenarios. Indeed, standard gradient-based
optimization methods are significantly less computationally expensive than
existing CL algorithms. In this paper, we study the progressive knowledge
accumulation (KA) in DNNs trained with gradient-based algorithms in long
sequences of tasks with data re-occurrence. We propose a new framework, SCoLe
(Scaling Continual Learning), to investigate KA and discover that catastrophic
forgetting has a limited effect on DNNs trained with SGD. When trained on long
sequences with data sparsely re-occurring, the overall accuracy improves, which
might be counter-intuitive given the CF phenomenon. We empirically investigate
KA in DNNs under various data occurrence frequencies and propose simple and
scalable strategies to increase knowledge accumulation in DNNs.
Related papers
- TS-ACL: A Time Series Analytic Continual Learning Framework for Privacy-Preserving and Class-Incremental Pattern Recognition [14.6394894445113]
We propose a Time Series Analytic Continual Learning framework, called TS-ACL.
Inspired by analytical learning, TS-ACL transforms neural network updates into gradient-free linear regression problems.
Our framework is highly suitable for real-time applications and large-scale data processing.
arXiv Detail & Related papers (2024-10-21T12:34:02Z) - Temporal-Difference Variational Continual Learning [89.32940051152782]
A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks.
In Continual Learning settings, models often struggle to balance learning new tasks with retaining previous knowledge.
We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.
arXiv Detail & Related papers (2024-10-10T10:58:41Z) - An Effective Dynamic Gradient Calibration Method for Continual Learning [11.555822066922508]
Continual learning (CL) is a fundamental topic in machine learning, where the goal is to train a model with continuously incoming data and tasks.
Due to the memory limit, we cannot store all the historical data, and therefore confront the catastrophic forgetting'' problem.
We develop an effective algorithm to calibrate the gradient in each updating step of the model.
arXiv Detail & Related papers (2024-07-30T16:30:09Z) - Adaptive Rentention & Correction for Continual Learning [114.5656325514408]
A common problem in continual learning is the classification layer's bias towards the most recent task.
We name our approach Adaptive Retention & Correction (ARC)
ARC achieves an average performance increase of 2.7% and 2.6% on the CIFAR-100 and Imagenet-R datasets.
arXiv Detail & Related papers (2024-05-23T08:43:09Z) - Overcoming the Stability Gap in Continual Learning [15.8696301825572]
Pre-trained deep neural networks (DNNs) are being widely deployed by industry for making business decisions and to serve users.
A major problem is model decay, where the DNN's predictions become more erroneous over time, resulting in revenue loss or unhappy users.
Here, we study how continual learning (CL) could potentially overcome model decay in large pre-trained DNNs.
arXiv Detail & Related papers (2023-06-02T20:24:55Z) - Learning Bayesian Sparse Networks with Full Experience Replay for
Continual Learning [54.7584721943286]
Continual Learning (CL) methods aim to enable machine learning models to learn new tasks without catastrophic forgetting of those that have been previously mastered.
Existing CL approaches often keep a buffer of previously-seen samples, perform knowledge distillation, or use regularization techniques towards this goal.
We propose to only activate and select sparse neurons for learning current and past tasks at any stage.
arXiv Detail & Related papers (2022-02-21T13:25:03Z) - AirLoop: Lifelong Loop Closure Detection [5.3759730885842725]
AirLoop is a method that leverages techniques from lifelong learning to minimize forgetting when training loop closure detection models incrementally.
We experimentally demonstrate the effectiveness of AirLoop on TartanAir, Nordland, and RobotCar datasets.
arXiv Detail & Related papers (2021-09-18T17:28:47Z) - Continual Learning in Recurrent Neural Networks [67.05499844830231]
We evaluate the effectiveness of continual learning methods for processing sequential data with recurrent neural networks (RNNs)
We shed light on the particularities that arise when applying weight-importance methods, such as elastic weight consolidation, to RNNs.
We show that the performance of weight-importance methods is not directly affected by the length of the processed sequences, but rather by high working memory requirements.
arXiv Detail & Related papers (2020-06-22T10:05:12Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.