Optimization and Generalization of Regularization-Based Continual
Learning: a Loss Approximation Viewpoint
- URL: http://arxiv.org/abs/2006.10974v3
- Date: Mon, 8 Feb 2021 23:50:16 GMT
- Title: Optimization and Generalization of Regularization-Based Continual
Learning: a Loss Approximation Viewpoint
- Authors: Dong Yin, Mehrdad Farajtabar, Ang Li, Nir Levine, Alex Mott
- Abstract summary: We provide a novel viewpoint of regularization-based continual learning by formulating it as a second-order Taylor approximation of the loss function of each task.
Based on this viewpoint, we study the optimization aspects (i.e., convergence) as well as generalization properties (i.e., finite-sample guarantees) of regularization-based continual learning.
- Score: 35.5156045701898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks have achieved remarkable success in many cognitive tasks.
However, when they are trained sequentially on multiple tasks without access to
old data, their performance on early tasks tend to drop significantly. This
problem is often referred to as catastrophic forgetting, a key challenge in
continual learning of neural networks. The regularization-based approach is one
of the primary classes of methods to alleviate catastrophic forgetting. In this
paper, we provide a novel viewpoint of regularization-based continual learning
by formulating it as a second-order Taylor approximation of the loss function
of each task. This viewpoint leads to a unified framework that can be
instantiated to derive many existing algorithms such as Elastic Weight
Consolidation and Kronecker factored Laplace approximation. Based on this
viewpoint, we study the optimization aspects (i.e., convergence) as well as
generalization properties (i.e., finite-sample guarantees) of
regularization-based continual learning. Our theoretical results indicate the
importance of accurate approximation of the Hessian matrix. The experimental
results on several benchmarks provide empirical validation of our theoretical
findings.
Related papers
- Embedding generalization within the learning dynamics: An approach based-on sample path large deviation theory [0.0]
We consider an empirical risk perturbation based learning problem that exploits methods from continuous-time perspective.
We provide an estimate in the small noise limit based on the Freidlin-Wentzell theory of large deviations.
We also present a computational algorithm that solves the corresponding variational problem leading to an optimal point estimates.
arXiv Detail & Related papers (2024-08-04T23:31:35Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Distributed Continual Learning with CoCoA in High-dimensional Linear
Regression [0.0]
We consider estimation under scenarios where the signals of interest exhibit change of characteristics over time.
In particular, we consider the continual learning problem where different tasks, e.g., data with different distributions, arrive sequentially.
We consider the well-established distributed learning algorithm COCOA, which distributes the model parameters and the corresponding features over the network.
arXiv Detail & Related papers (2023-12-04T10:35:46Z) - Regularization, early-stopping and dreaming: a Hopfield-like setup to
address generalization and overfitting [0.0]
We look for optimal network parameters by applying a gradient descent over a regularized loss function.
Within this framework, the optimal neuron-interaction matrices correspond to Hebbian kernels revised by a reiterated unlearning protocol.
arXiv Detail & Related papers (2023-08-01T15:04:30Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - On the generalization of learning algorithms that do not converge [54.122745736433856]
Generalization analyses of deep learning typically assume that the training converges to a fixed point.
Recent results indicate that in practice, the weights of deep neural networks optimized with gradient descent often oscillate indefinitely.
arXiv Detail & Related papers (2022-08-16T21:22:34Z) - Learning Non-Vacuous Generalization Bounds from Optimization [8.294831479902658]
We present a simple yet non-vacuous generalization bound from the optimization perspective.
We achieve this goal by leveraging that the hypothesis set accessed by gradient algorithms is essentially fractal-like.
Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks.
arXiv Detail & Related papers (2022-06-09T08:59:46Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Density Fixing: Simple yet Effective Regularization Method based on the
Class Prior [2.3859169601259347]
We propose a framework of regularization methods, called density-fixing, that can be used commonly for supervised and semi-supervised learning.
Our proposed regularization method improves the generalization performance by forcing the model to approximate the class's prior distribution or the frequency of occurrence.
arXiv Detail & Related papers (2020-07-08T04:58:22Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.