Generalization and Optimization of SGD with Lookahead
- URL: http://arxiv.org/abs/2509.15776v1
- Date: Fri, 19 Sep 2025 09:02:09 GMT
- Title: Generalization and Optimization of SGD with Lookahead
- Authors: Kangcheng Li, Yunwen Lei,
- Abstract summary: Lookahead enhances deep learning models by employing a dual-weight update mechanism.<n>Most theoretical studies focus on its convergence on training data, leaving its generalization capabilities less understood.
- Score: 20.363815126393884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Lookahead optimizer enhances deep learning models by employing a dual-weight update mechanism, which has been shown to improve the performance of underlying optimizers such as SGD. However, most theoretical studies focus on its convergence on training data, leaving its generalization capabilities less understood. Existing generalization analyses are often limited by restrictive assumptions, such as requiring the loss function to be globally Lipschitz continuous, and their bounds do not fully capture the relationship between optimization and generalization. In this paper, we address these issues by conducting a rigorous stability and generalization analysis of the Lookahead optimizer with minibatch SGD. We leverage on-average model stability to derive generalization bounds for both convex and strongly convex problems without the restrictive Lipschitzness assumption. Our analysis demonstrates a linear speedup with respect to the batch size in the convex setting.
Related papers
- A Simplified Analysis of SGD for Linear Regression with Weight Averaging [64.2393952273612]
Recent work bycitetzou 2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate.<n>We provide a simplified analysis recovering the same bias and variance bounds provided incitepzou 2021benign based on simple linear algebra tools.<n>We believe our work makes the analysis of gradient descent on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling.
arXiv Detail & Related papers (2025-06-18T15:10:38Z) - Understanding Generalization of Federated Learning: the Trade-off between Model Stability and Optimization [34.520966684699665]
Federated Learning (FL) is a distributed learning approach that trains machine learning models across multiple devices.<n>This paper introduces an innovative dynamics analysis framework, namely textitLibra, for algorithm generalization performance.<n>We show that larger local steps or momentum accelerate convergence of gradient norms, while worsening model stability.
arXiv Detail & Related papers (2024-11-25T11:43:22Z) - Improved Stability and Generalization Guarantees of the Decentralized SGD Algorithm [33.64407835198723]
We present a new generalization analysis for Decentralized Gradient Descent (D-SGD) based on algorithmic stability.
This new reveals that the choice of graph can in fact improve the worst-case convex and non-connected functions.
arXiv Detail & Related papers (2023-06-05T15:03:01Z) - On Stability and Generalization of Bilevel Optimization Problem [39.662459636180174]
(Stochastic) bilevel optimization is a frequently encountered problem in machine learning with a wide range of applications.
We first provide a connection between stability and error in different forms and give a high probability which improves the previous best one.
We then provide first stability for the where both inner outer level parameters are continuous, while existing allows only the outer level parameter to be updated.
arXiv Detail & Related papers (2022-10-03T16:22:57Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - Learning Prediction Intervals for Regression: Generalization and
Calibration [12.576284277353606]
We study the generation of prediction intervals in regression for uncertainty quantification.
We use a general learning theory to characterize the optimality-feasibility tradeoff that encompasses Lipschitz continuity and VC-subgraph classes.
We empirically demonstrate the strengths of our interval generation and calibration algorithms in terms of testing performances compared to existing benchmarks.
arXiv Detail & Related papers (2021-02-26T17:55:30Z) - SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and
Interpolation [17.199023009789308]
The Expected assumption of SGD (SGD) is being used routinely for non-artisan functions.
In this paper, we show a paradigms for convergence to a smooth non-linear setting.
We also provide theoretical guarantees for different step-size conditions.
arXiv Detail & Related papers (2020-06-18T07:05:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.