How catastrophic can catastrophic forgetting be in linear regression?
- URL: http://arxiv.org/abs/2205.09588v1
- Date: Thu, 19 May 2022 14:28:40 GMT
- Title: How catastrophic can catastrophic forgetting be in linear regression?
- Authors: Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry
- Abstract summary: We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks.
We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method.
- Score: 30.702863017223457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To better understand catastrophic forgetting, we study fitting an
overparameterized linear model to a sequence of tasks with different input
distributions. We analyze how much the model forgets the true labels of earlier
tasks after training on subsequent tasks, obtaining exact expressions and
bounds. We establish connections between continual learning in the linear
setting and two other research areas: alternating projections and the Kaczmarz
method. In specific settings, we highlight differences between forgetting and
convergence to the offline solution as studied in those areas. In particular,
when T tasks in d dimensions are presented cyclically for k iterations, we
prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This
stands in contrast to the convergence to the offline solution, which can be
arbitrarily slow according to existing alternating projection results. We
further show that the T^2 factor can be lifted when tasks are presented in a
random ordering.
Related papers
- Soup to go: mitigating forgetting during continual learning with model averaging [24.3125190049867]
In continual learning, fine-tuning on later tasks will often lead to performance degradation on earlier tasks.
Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA)
Our method achieves comparable results without the need to store past data.
In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
arXiv Detail & Related papers (2025-01-09T20:11:08Z) - Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance.
We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks.
We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z) - Task-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning [0.3281128493853064]
We tackle the problem of training a model on a sequence of tasks without access to past data.
Existing methods represent classes as Gaussian distributions in the feature extractor's latent space.
We propose AdaGauss -- a novel method that adapts covariance matrices from task to task.
arXiv Detail & Related papers (2024-09-26T20:18:14Z) - Amortizing intractable inference in diffusion models for vision, language, and control [89.65631572949702]
This paper studies amortized sampling of the posterior over data, $mathbfxsim prm post(mathbfx)propto p(mathbfx)r(mathbfx)$, in a model that consists of a diffusion generative model prior $p(mathbfx)$ and a black-box constraint or function $r(mathbfx)$.
We prove the correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from
arXiv Detail & Related papers (2024-05-31T16:18:46Z) - The Joint Effect of Task Similarity and Overparameterization on
Catastrophic Forgetting -- An Analytical Model [36.766748277141744]
In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks.
Previous works have analyzed separately how forgetting is affected by either task similarity or over parameterization.
This paper examines how task similarity and over parameterization jointly affect forgetting in an analyzable model.
arXiv Detail & Related papers (2024-01-23T10:16:44Z) - Continual learning for surface defect segmentation by subnetwork
creation and selection [55.2480439325792]
We introduce a new continual (or lifelong) learning algorithm that performs segmentation tasks without undergoing catastrophic forgetting.
The method is applied to two different surface defect segmentation problems that are learned incrementally.
Our approach shows comparable results with joint training when all the training data (all defects) are seen simultaneously.
arXiv Detail & Related papers (2023-12-08T15:28:50Z) - Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set.
For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z) - Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming [53.63469275932989]
We consider online statistical inference of constrained nonlinear optimization problems.
We apply the Sequential Quadratic Programming (StoSQP) method to solve these problems.
arXiv Detail & Related papers (2022-05-27T00:34:03Z) - Contrastive learning of strong-mixing continuous-time stochastic
processes [53.82893653745542]
Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data.
We show that a properly constructed contrastive learning task can be used to estimate the transition kernel for small-to-mid-range intervals in the diffusion case.
arXiv Detail & Related papers (2021-03-03T23:06:47Z) - Consistent Online Gaussian Process Regression Without the Sample
Complexity Bottleneck [14.309243378538012]
We propose an online compression scheme that fixes an error neighborhood with respect to the Hellinger metric centered at the current posterior.
For constant error radius, POG converges to a neighborhood of the population posterior (Theorem 1(ii))but with finite memory at-worst determined by the metric entropy of the feature space.
arXiv Detail & Related papers (2020-04-23T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.