Related papers: How catastrophic can catastrophic forgetting be in linear regression?

How catastrophic can catastrophic forgetting be in linear regression?

URL: http://arxiv.org/abs/2205.09588v1
Date: Thu, 19 May 2022 14:28:40 GMT
Title: How catastrophic can catastrophic forgetting be in linear regression?
Authors: Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry
Abstract summary: We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method.
Score: 30.702863017223457
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering.

Related papers

Unsupervised Ground Metric Learning [1.2499537119440245]
We consider both the algorithmic and the modeling part of unsupervised metric learning.<n>In particular, we propose to use the random function algorithm and prove that it converges linearly for our setting.<n>We show how Mahalanobis-like distances fit into our considerations.
arXiv Detail & Related papers (2025-07-17T13:06:24Z)
Test Set Sizing for the Ridge Regression [55.2480439325792]
This is the first time that such a split is calculated mathematically for a machine learning model in the large data limit. The goal of the calculations is to maximize "integrity," so that the measured error in the trained model is as close as possible to what it theoretically should be.
arXiv Detail & Related papers (2025-04-27T13:17:18Z)
Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification [12.699007098398805]
We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) We show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution when tasks are jointly separable. We also analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function.
arXiv Detail & Related papers (2025-04-17T07:35:48Z)
Better Rates for Random Task Orderings in Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. We develop novel last-iterate bounds in the realizable least squares setup, and apply them to derive new results for continual learning. We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Soup to go: mitigating forgetting during continual learning with model averaging [24.3125190049867]
In continual learning, fine-tuning on later tasks will often lead to performance degradation on earlier tasks. Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA) Our method achieves comparable results without the need to store past data. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
arXiv Detail & Related papers (2025-01-09T20:11:08Z)
Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance. We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks. We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z)
Task-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning [0.3281128493853064]
We tackle the problem of training a model on a sequence of tasks without access to past data. Existing methods represent classes as Gaussian distributions in the feature extractor's latent space. We propose AdaGauss -- a novel method that adapts covariance matrices from task to task.
arXiv Detail & Related papers (2024-09-26T20:18:14Z)
Amortizing intractable inference in diffusion models for vision, language, and control [89.65631572949702]
This paper studies amortized sampling of the posterior over data, $mathbfxsim prm post(mathbfx)propto p(mathbfx)r(mathbfx)$, in a model that consists of a diffusion generative model prior $p(mathbfx)$ and a black-box constraint or function $r(mathbfx)$. We prove the correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from
arXiv Detail & Related papers (2024-05-31T16:18:46Z)
The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting -- An Analytical Model [36.766748277141744]
In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks. Previous works have analyzed separately how forgetting is affected by either task similarity or over parameterization. This paper examines how task similarity and over parameterization jointly affect forgetting in an analyzable model.
arXiv Detail & Related papers (2024-01-23T10:16:44Z)
Continual learning for surface defect segmentation by subnetwork creation and selection [55.2480439325792]
We introduce a new continual (or lifelong) learning algorithm that performs segmentation tasks without undergoing catastrophic forgetting. The method is applied to two different surface defect segmentation problems that are learned incrementally. Our approach shows comparable results with joint training when all the training data (all defects) are seen simultaneously.
arXiv Detail & Related papers (2023-12-08T15:28:50Z)
Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set. For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z)
Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming [53.63469275932989]
We consider online statistical inference of constrained nonlinear optimization problems. We apply the Sequential Quadratic Programming (StoSQP) method to solve these problems.
arXiv Detail & Related papers (2022-05-27T00:34:03Z)
Benign-Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression [14.493176427999028]
We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE) with linear regression models. We show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known.
arXiv Detail & Related papers (2022-02-10T18:51:52Z)
Contrastive learning of strong-mixing continuous-time stochastic processes [53.82893653745542]
Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data. We show that a properly constructed contrastive learning task can be used to estimate the transition kernel for small-to-mid-range intervals in the diffusion case.
arXiv Detail & Related papers (2021-03-03T23:06:47Z)
Precise High-Dimensional Asymptotics for Quantifying Heterogeneous Transfers [66.66228496844191]
The problem of learning one task with samples from another task is central to transfer learning (TL) In this paper, we examine a fundamental question: When does combining the data samples from a source task and a target task perform better than single-task learning with the target task alone?
arXiv Detail & Related papers (2020-10-22T14:14:20Z)
Consistent Online Gaussian Process Regression Without the Sample Complexity Bottleneck [14.309243378538012]
We propose an online compression scheme that fixes an error neighborhood with respect to the Hellinger metric centered at the current posterior. For constant error radius, POG converges to a neighborhood of the population posterior (Theorem 1(ii))but with finite memory at-worst determined by the metric entropy of the feature space.
arXiv Detail & Related papers (2020-04-23T11:52:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.