Related papers: Analysis of Overparameterization in Continual Learning under a Linear Model

Analysis of Overparameterization in Continual Learning under a Linear Model

URL: http://arxiv.org/abs/2502.10442v1
Date: Tue, 11 Feb 2025 00:15:38 GMT
Title: Analysis of Overparameterization in Continual Learning under a Linear Model
Authors: Daniel Goldfarb, Paul Hand,
Abstract summary: We study continual learning and catastrophic forgetting from a theoretical perspective in the simple setting of gradient descent.<n>We analytically demonstrate that over parameterization alone can mitigate forgetting in the context of a linear regression model.<n>As part of this work, we establish a non-asymptotic bound of the risk of a single linear regression task, which may be of independent interest to the field of double descent theory.
Score: 5.5165579223151795
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous machine learning systems that learn many tasks in sequence are prone to the catastrophic forgetting problem. Mathematical theory is needed in order to understand the extent of forgetting during continual learning. As a foundational step towards this goal, we study continual learning and catastrophic forgetting from a theoretical perspective in the simple setting of gradient descent with no explicit algorithmic mechanism to prevent forgetting. In this setting, we analytically demonstrate that overparameterization alone can mitigate forgetting in the context of a linear regression model. We consider a two-task setting motivated by permutation tasks, and show that as the overparameterization ratio becomes sufficiently high, a model trained on both tasks in sequence results in a low-risk estimator for the first task. As part of this work, we establish a non-asymptotic bound of the risk of a single linear regression task, which may be of independent interest to the field of double descent theory.

Related papers

Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning [59.6658995479243]
We propose texttext-Perturb-and-Merge (P&M), a novel continual learning framework that integrates model merging into the CL paradigm to avoid forgetting.<n>Through theoretical analysis, we minimize the total loss increase across all tasks and derive an analytical solution for the optimal merging coefficient.<n>Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets.
arXiv Detail & Related papers (2025-05-28T14:14:19Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Attention layers provably solve single-location regression [12.355792442566681]
Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding tokenwise sparsity and internal linear structures.<n>We introduce the single-location regression task, where only one token in a sequence determines the output, and position is a latent, retrievable via a linear projection.
arXiv Detail & Related papers (2024-10-02T13:28:02Z)
Embedding generalization within the learning dynamics: An approach based-on sample path large deviation theory [0.0]
We consider an empirical risk perturbation based learning problem that exploits methods from continuous-time perspective. We provide an estimate in the small noise limit based on the Freidlin-Wentzell theory of large deviations. We also present a computational algorithm that solves the corresponding variational problem leading to an optimal point estimates.
arXiv Detail & Related papers (2024-08-04T23:31:35Z)
Understanding Forgetting in Continual Learning with Linear Regression [21.8755265936716]
Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. We provide a general theoretical analysis of forgetting in the linear regression model via Gradient Descent. We demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence, where tasks with larger eigenvalues in their population data covariance matrices are trained later, tends to result in increased forgetting.
arXiv Detail & Related papers (2024-05-27T18:33:37Z)
How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities. We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing [74.2952487120137]
It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem.
arXiv Detail & Related papers (2023-01-27T02:30:51Z)
Analysis of Catastrophic Forgetting for Random Orthogonal Transformation Tasks in the Overparameterized Regime [9.184987303791292]
We show that in permuted MNIST image classification tasks, the performance of multilayer perceptrons trained by vanilla gradient descent can be improved. We provide a theoretical explanation of this effect by studying a qualitatively similar two-task linear regression problem. We show that when a model is trained on the two tasks in sequence without any additional regularization, the risk gain on the first task is small.
arXiv Detail & Related papers (2022-06-01T18:04:33Z)
Mitigating multiple descents: A model-agnostic framework for risk monotonization [84.6382406922369]
We develop a general framework for risk monotonization based on cross-validation. We propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting.
arXiv Detail & Related papers (2022-05-25T17:41:40Z)
Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.