Memory-Statistics Tradeoff in Continual Learning with Structural Regularization
- URL: http://arxiv.org/abs/2504.04039v1
- Date: Sat, 05 Apr 2025 03:14:10 GMT
- Title: Memory-Statistics Tradeoff in Continual Learning with Structural Regularization
- Authors: Haoran Li, Jingfeng Wu, Vladimir Braverman,
- Abstract summary: We study the statistical performance of a continual learning problem with two linear regression tasks in a well-specified random design setting.<n>We consider a structural regularization algorithm that incorporates a generalized $ell$-regularization tailored to the Hessian of the previous task for mitigating catastrophic forgetting.
- Score: 27.154013172469853
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We study the statistical performance of a continual learning problem with two linear regression tasks in a well-specified random design setting. We consider a structural regularization algorithm that incorporates a generalized $\ell_2$-regularization tailored to the Hessian of the previous task for mitigating catastrophic forgetting. We establish upper and lower bounds on the joint excess risk for this algorithm. Our analysis reveals a fundamental trade-off between memory complexity and statistical efficiency, where memory complexity is measured by the number of vectors needed to define the structural regularization. Specifically, increasing the number of vectors in structural regularization leads to a worse memory complexity but an improved excess risk, and vice versa. Furthermore, our theory suggests that naive continual learning without regularization suffers from catastrophic forgetting, while structural regularization mitigates this issue. Notably, structural regularization achieves comparable performance to joint training with access to both tasks simultaneously. These results highlight the critical role of curvature-aware regularization for continual learning.
Related papers
- On the Geometry of Regularization in Adversarial Training: High-Dimensional Asymptotics and Generalization Bounds [11.30047438005394]
This work investigates the question of how to choose the regularization norm $lVert cdot rVert$ in the context of high-dimensional adversarial training for binary classification.
We quantitatively characterize the relationship between perturbation size and the optimal choice of $lVert cdot rVert$, confirming the intuition that, in the data scarce regime, the type of regularization becomes increasingly important for adversarial training as perturbations grow in size.
arXiv Detail & Related papers (2024-10-21T14:53:12Z) - Regularization for Adversarial Robust Learning [18.46110328123008]
We develop a novel approach to adversarial training that integrates $phi$-divergence regularization into the distributionally robust risk function.
This regularization brings a notable improvement in computation compared with the original formulation.
We validate our proposed method in supervised learning, reinforcement learning, and contextual learning and showcase its state-of-the-art performance against various adversarial attacks.
arXiv Detail & Related papers (2024-08-19T03:15:41Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Structured Prediction in Online Learning [66.36004256710824]
We study a theoretical and algorithmic framework for structured prediction in the online learning setting.
We show that our algorithm is a generalisation of optimal algorithms from the supervised learning setting.
We consider a second algorithm designed especially for non-stationary data distributions, including adversarial data.
arXiv Detail & Related papers (2024-06-18T07:45:02Z) - A Statistical Theory of Regularization-Based Continual Learning [10.899175512941053]
We provide a statistical analysis of regularization-based continual learning on a sequence of linear regression tasks.
We first derive the convergence rate for the oracle estimator obtained as if all data were available simultaneously.
A byproduct of our theoretical analysis is the equivalence between early stopping and generalized $ell$-regularization.
arXiv Detail & Related papers (2024-06-10T12:25:13Z) - On the generalization of learning algorithms that do not converge [54.122745736433856]
Generalization analyses of deep learning typically assume that the training converges to a fixed point.
Recent results indicate that in practice, the weights of deep neural networks optimized with gradient descent often oscillate indefinitely.
arXiv Detail & Related papers (2022-08-16T21:22:34Z) - Optimistic Rates: A Unifying Theory for Interpolation Learning and
Regularization in Linear Regression [35.78863301525758]
We study a localized notion of uniform convergence known as an "optimistic rate"
Our refined analysis avoids the hidden constant and logarithmic factor in existing results.
arXiv Detail & Related papers (2021-12-08T18:55:00Z) - Multi-task Learning of Order-Consistent Causal Graphs [59.9575145128345]
We consider the problem of discovering $K related Gaussian acyclic graphs (DAGs)
Under multi-task learning setting, we propose a $l_1/l$-regularized maximum likelihood estimator (MLE) for learning $K$ linear structural equation models.
We theoretically show that the joint estimator, by leveraging data across related tasks, can achieve a better sample complexity for recovering the causal order.
arXiv Detail & Related papers (2021-11-03T22:10:18Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - Fine-grained Generalization Analysis of Vector-valued Learning [28.722350261462463]
We start the generalization analysis of regularized vector-valued learning algorithms by presenting bounds with a mild dependency on the output dimension and a fast rate on the sample size.
To understand the interaction between optimization and learning, we further use our results to derive the first bounds for descent with vector-valued functions.
As a byproduct, we derive a Rademacher complexity bound for loss function classes defined in terms of a general strongly convex function.
arXiv Detail & Related papers (2021-04-29T07:57:34Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - CASTLE: Regularization via Auxiliary Causal Graph Discovery [89.74800176981842]
We introduce Causal Structure Learning (CASTLE) regularization and propose to regularize a neural network by jointly learning the causal relationships between variables.
CASTLE efficiently reconstructs only the features in the causal DAG that have a causal neighbor, whereas reconstruction-based regularizers suboptimally reconstruct all input features.
arXiv Detail & Related papers (2020-09-28T09:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.