Analysis of Catastrophic Forgetting for Random Orthogonal Transformation
Tasks in the Overparameterized Regime
- URL: http://arxiv.org/abs/2207.06475v1
- Date: Wed, 1 Jun 2022 18:04:33 GMT
- Title: Analysis of Catastrophic Forgetting for Random Orthogonal Transformation
Tasks in the Overparameterized Regime
- Authors: Daniel Goldfarb, Paul Hand
- Abstract summary: We show that in permuted MNIST image classification tasks, the performance of multilayer perceptrons trained by vanilla gradient descent can be improved.
We provide a theoretical explanation of this effect by studying a qualitatively similar two-task linear regression problem.
We show that when a model is trained on the two tasks in sequence without any additional regularization, the risk gain on the first task is small.
- Score: 9.184987303791292
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Overparameterization is known to permit strong generalization performance in
neural networks. In this work, we provide an initial theoretical analysis of
its effect on catastrophic forgetting in a continual learning setup. We show
experimentally that in permuted MNIST image classification tasks, the
generalization performance of multilayer perceptrons trained by vanilla
stochastic gradient descent can be improved by overparameterization, and the
extent of the performance increase achieved by overparameterization is
comparable to that of state-of-the-art continual learning algorithms. We
provide a theoretical explanation of this effect by studying a qualitatively
similar two-task linear regression problem, where each task is related by a
random orthogonal transformation. We show that when a model is trained on the
two tasks in sequence without any additional regularization, the risk gain on
the first task is small if the model is sufficiently overparameterized.
Related papers
- A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities [30.737171081270322]
We study how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step.
This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
arXiv Detail & Related papers (2024-10-24T17:24:34Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - The Joint Effect of Task Similarity and Overparameterization on
Catastrophic Forgetting -- An Analytical Model [36.766748277141744]
In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks.
Previous works have analyzed separately how forgetting is affected by either task similarity or over parameterization.
This paper examines how task similarity and over parameterization jointly affect forgetting in an analyzable model.
arXiv Detail & Related papers (2024-01-23T10:16:44Z) - Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so.
We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed.
Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z) - Regularization, early-stopping and dreaming: a Hopfield-like setup to
address generalization and overfitting [0.0]
We look for optimal network parameters by applying a gradient descent over a regularized loss function.
Within this framework, the optimal neuron-interaction matrices correspond to Hebbian kernels revised by a reiterated unlearning protocol.
arXiv Detail & Related papers (2023-08-01T15:04:30Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Learning Stochastic Graph Neural Networks with Constrained Variance [18.32587282139282]
graph neural networks (SGNNs) are information processing architectures that learn representations from data over random graphs.
We propose a variance-constrained optimization problem for SGNNs, balancing the expected performance and the deviation.
An alternating gradient-dual learning procedure is undertaken that solves the problem by updating the SGNN parameters with descent and the dual variable with ascent.
arXiv Detail & Related papers (2022-01-29T15:55:58Z) - The curse of overparametrization in adversarial training: Precise
analysis of robust generalization for random features regression [34.35440701530876]
We show that for adversarially trained random features models, high overparametrization can hurt robust generalization.
Our developed theory reveals the nontrivial effect of overparametrization on robustness and indicates that for adversarially trained random features models, high overparametrization can hurt robust generalization.
arXiv Detail & Related papers (2022-01-13T18:57:30Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.