How catastrophic can catastrophic forgetting be in linear regression?
- URL: http://arxiv.org/abs/2205.09588v1
- Date: Thu, 19 May 2022 14:28:40 GMT
- Title: How catastrophic can catastrophic forgetting be in linear regression?
- Authors: Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry
- Abstract summary: We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks.
We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method.
- Score: 30.702863017223457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To better understand catastrophic forgetting, we study fitting an
overparameterized linear model to a sequence of tasks with different input
distributions. We analyze how much the model forgets the true labels of earlier
tasks after training on subsequent tasks, obtaining exact expressions and
bounds. We establish connections between continual learning in the linear
setting and two other research areas: alternating projections and the Kaczmarz
method. In specific settings, we highlight differences between forgetting and
convergence to the offline solution as studied in those areas. In particular,
when T tasks in d dimensions are presented cyclically for k iterations, we
prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This
stands in contrast to the convergence to the offline solution, which can be
arbitrarily slow according to existing alternating projection results. We
further show that the T^2 factor can be lifted when tasks are presented in a
random ordering.
Related papers
- Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance.
We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks.
We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z) - Task-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning [0.3281128493853064]
We tackle the problem of training a model on a sequence of tasks without access to past data.
Existing methods represent classes as Gaussian distributions in the feature extractor's latent space.
We propose AdaGauss -- a novel method that adapts covariance matrices from task to task.
arXiv Detail & Related papers (2024-09-26T20:18:14Z) - Amortizing intractable inference in diffusion models for vision, language, and control [89.65631572949702]
This paper studies amortized sampling of the posterior over data, $mathbfxsim prm post(mathbfx)propto p(mathbfx)r(mathbfx)$, in a model that consists of a diffusion generative model prior $p(mathbfx)$ and a black-box constraint or function $r(mathbfx)$.
We prove the correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from
arXiv Detail & Related papers (2024-05-31T16:18:46Z) - The Joint Effect of Task Similarity and Overparameterization on
Catastrophic Forgetting -- An Analytical Model [36.766748277141744]
In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks.
Previous works have analyzed separately how forgetting is affected by either task similarity or over parameterization.
This paper examines how task similarity and over parameterization jointly affect forgetting in an analyzable model.
arXiv Detail & Related papers (2024-01-23T10:16:44Z) - Continual learning for surface defect segmentation by subnetwork
creation and selection [55.2480439325792]
We introduce a new continual (or lifelong) learning algorithm that performs segmentation tasks without undergoing catastrophic forgetting.
The method is applied to two different surface defect segmentation problems that are learned incrementally.
Our approach shows comparable results with joint training when all the training data (all defects) are seen simultaneously.
arXiv Detail & Related papers (2023-12-08T15:28:50Z) - Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set.
For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z) - Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming [53.63469275932989]
We consider online statistical inference of constrained nonlinear optimization problems.
We apply the Sequential Quadratic Programming (StoSQP) method to solve these problems.
arXiv Detail & Related papers (2022-05-27T00:34:03Z) - Benign-Overfitting in Conditional Average Treatment Effect Prediction
with Linear Regression [14.493176427999028]
We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE) with linear regression models.
We show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known.
arXiv Detail & Related papers (2022-02-10T18:51:52Z) - Contrastive learning of strong-mixing continuous-time stochastic
processes [53.82893653745542]
Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data.
We show that a properly constructed contrastive learning task can be used to estimate the transition kernel for small-to-mid-range intervals in the diffusion case.
arXiv Detail & Related papers (2021-03-03T23:06:47Z) - Consistent Online Gaussian Process Regression Without the Sample
Complexity Bottleneck [14.309243378538012]
We propose an online compression scheme that fixes an error neighborhood with respect to the Hellinger metric centered at the current posterior.
For constant error radius, POG converges to a neighborhood of the population posterior (Theorem 1(ii))but with finite memory at-worst determined by the metric entropy of the feature space.
arXiv Detail & Related papers (2020-04-23T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.