Related papers: Optimal Rates in Continual Linear Regression via Increasing Regularization

Optimal Rates in Continual Linear Regression via Increasing Regularization

URL: http://arxiv.org/abs/2506.06501v1
Date: Fri, 06 Jun 2025 19:51:14 GMT
Title: Optimal Rates in Continual Linear Regression via Increasing Regularization
Authors: Ran Levinstein, Amit Attia, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, Itay Evron,
Abstract summary: We study realizable continual linear regression under random task orderings.<n>In this setup, the worst-case expected loss after $k$ learning admits a lower bound of $Omega (1/k)$.<n>We use two frequently used regularization schemes: explicit isotropic $ell$ regularization, and implicit regularization via finite step budgets.
Score: 36.31850682655034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study realizable continual linear regression under random task orderings, a common setting for developing continual learning theory. In this setup, the worst-case expected loss after $k$ learning iterations admits a lower bound of $\Omega(1/k)$. However, prior work using an unregularized scheme has only established an upper bound of $O(1/k^{1/4})$, leaving a significant gap. Our paper proves that this gap can be narrowed, or even closed, using two frequently used regularization schemes: (1) explicit isotropic $\ell_2$ regularization, and (2) implicit regularization via finite step budgets. We show that these approaches, which are used in practice to mitigate forgetting, reduce to stochastic gradient descent (SGD) on carefully defined surrogate losses. Through this lens, we identify a fixed regularization strength that yields a near-optimal rate of $O(\log k / k)$. Moreover, formalizing and analyzing a generalized variant of SGD for time-varying functions, we derive an increasing regularization strength schedule that provably achieves an optimal rate of $O(1/k)$. This suggests that schedules that increase the regularization coefficient or decrease the number of steps per task are beneficial, at least in the worst case.

Related papers

Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization [34.451177321785146]
We study the out-of-sample performance of multi-pass gradient descent (SGD) in the fundamental convex optimization (SCO) model.<n>We show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample significantly and lead to overfitting.
arXiv Detail & Related papers (2025-05-13T07:32:48Z)
From Continual Learning to SGD and Back: Better Rates for Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate upper bounds in the realizable least squares setup.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic in sufficiently long task sequences.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$. We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z)
Distributionally Robust Optimization with Bias and Variance Reduction [9.341215359733601]
We show that Prospect, a gradient-based algorithm, enjoys linear convergence for smooth regularized losses. We also show that Prospect can converge 2-3$times$ faster than baselines such as gradient-based methods.
arXiv Detail & Related papers (2023-10-21T00:03:54Z)
Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex Optimization [9.019243171993553]
Training steps $T$ and step-size $eta$ might affect certify in smooth convex optimization (SCO) problems. We first provide tight excess risk lower bounds for Gradient Descent (GD) and Gradient Descent (SGD) Recent works show better rates can be attained but the improvement is reduced when training time is long.
arXiv Detail & Related papers (2023-03-19T20:24:33Z)
Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data. We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z)
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails. We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z)
SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD) We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error. We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z)
On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD) We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
Momentum Improves Normalized SGD [51.27183254738711]
We show that adding momentum provably removes the need for large batch sizes on objectives. We show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining.
arXiv Detail & Related papers (2020-02-09T07:00:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.