Related papers: Local SGD Accelerates Convergence by Exploiting Second Order Information of the Loss Function

Local SGD Accelerates Convergence by Exploiting Second Order Information of the Loss Function

URL: http://arxiv.org/abs/2305.15013v2
Date: Fri, 26 May 2023 05:18:28 GMT
Title: Local SGD Accelerates Convergence by Exploiting Second Order Information of the Loss Function
Authors: Linxuan Pan, Shenghui Song
Abstract summary: Local statistical gradient descent (L-SGD) has been proven to be very effective in distributed machine learning schemes. In this paper, we offer a new perspective to understand the strength of L-SGD.
Score: 1.7767466724342065
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With multiple iterations of updates, local statistical gradient descent (L-SGD) has been proven to be very effective in distributed machine learning schemes such as federated learning. In fact, many innovative works have shown that L-SGD with independent and identically distributed (IID) data can even outperform SGD. As a result, extensive efforts have been made to unveil the power of L-SGD. However, existing analysis failed to explain why the multiple local updates with small mini-batches of data (L-SGD) can not be replaced by the update with one big batch of data and a larger learning rate (SGD). In this paper, we offer a new perspective to understand the strength of L-SGD. We theoretically prove that, with IID data, L-SGD can effectively explore the second order information of the loss function. In particular, compared with SGD, the updates of L-SGD have much larger projection on the eigenvectors of the Hessian matrix with small eigenvalues, which leads to faster convergence. Under certain conditions, L-SGD can even approach the Newton method. Experiment results over two popular datasets validate the theoretical results.

Related papers

Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression [55.2480439325792]
Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models. In this paper, we establish the risk bound of online SGD with EMA for high-dimensional linear regression.
arXiv Detail & Related papers (2025-02-19T21:55:11Z)
The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication [37.210933391984014]
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice. We provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions. We also show the min-max optimality of accelerated mini-batch SGD for several problem classes.
arXiv Detail & Related papers (2024-05-19T20:20:03Z)
Why (and When) does Local SGD Generalize Better than SGD? [46.993699881100454]
Local SGD is a communication-efficient variant of SGD for large-scale training. This paper aims to understand why (and when) Local SGD generalizes better based on Differential Equation (SDE) approximation.
arXiv Detail & Related papers (2023-03-02T12:56:52Z)
DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting. By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem. We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z)
Heavy-Tail Phenomenon in Decentralized SGD [33.63000461985398]
We study the emergence of heavy-tails in decentralized gradient descent (DE-SGD) We also investigate the effect of decentralization on the tail behavior. Our theory uncovers an interesting interplay between the tails and the network structure.
arXiv Detail & Related papers (2022-05-13T14:47:04Z)
Learning Mixtures of Linear Dynamical Systems [94.49754087817931]
We develop a two-stage meta-algorithm to efficiently recover each ground-truth LDS model up to error $tildeO(sqrtd/T)$. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
arXiv Detail & Related papers (2022-01-26T22:26:01Z)
SGD with a Constant Large Learning Rate Can Converge to Local Maxima [4.014524824655106]
We construct worst-case optimization problems illustrating that gradient descent can exhibit strange and potentially undesirable behaviors. Specifically, we construct landscapes and data distributions such that SGD converges to local maxima. Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes.
arXiv Detail & Related papers (2021-07-25T10:12:18Z)
SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs [30.41773138781369]
We present a multi-epoch variant of Gradient Descent (SGD) commonly used in practice. We prove that this is at least as good as single pass SGD in the worst case. For certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD.
arXiv Detail & Related papers (2021-07-11T15:50:01Z)
Why Does Multi-Epoch Training Help? [62.946840431501855]
Empirically, it has been observed that taking more one pass over training data (multi-pass SGD) has much better excess risk bound performance than SGD only taking one pass over training data (one-pass SGD) In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstances.
arXiv Detail & Related papers (2021-05-13T00:52:25Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Variance Reduced Local SGD with Lower Communication Complexity [52.44473777232414]
We propose Variance Reduced Local SGD to further reduce the communication complexity. VRL-SGD achieves a emphlinear iteration speedup with a lower communication complexity $O(Tfrac12 Nfrac32)$ even if workers access non-identical datasets.
arXiv Detail & Related papers (2019-12-30T08:15:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.