Escaping Saddle Points with Compressed SGD
- URL: http://arxiv.org/abs/2105.10090v1
- Date: Fri, 21 May 2021 01:56:43 GMT
- Title: Escaping Saddle Points with Compressed SGD
- Authors: Dmitrii Avdiukhin, Grigory Yaroslavtsev
- Abstract summary: gradient descent (SGD) is a prevalent optimization technique for large-scale distributed machine learning.
We show that SGD augmented with gradient compression converges to an $varepsilon$-first-order stationary point.
When the gradient is not Lipschitz, SGD with RandomK compressor converges to an $varepsilon$-SOSP with the same number of iterations as SGD.
- Score: 8.014396597444296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stochastic gradient descent (SGD) is a prevalent optimization technique for
large-scale distributed machine learning. While SGD computation can be
efficiently divided between multiple machines, communication typically becomes
a bottleneck in the distributed setting. Gradient compression methods can be
used to alleviate this problem, and a recent line of work shows that SGD
augmented with gradient compression converges to an $\varepsilon$-first-order
stationary point. In this paper we extend these results to convergence to an
$\varepsilon$-second-order stationary point ($\varepsilon$-SOSP), which is to
the best of our knowledge the first result of this type. In addition, we show
that, when the stochastic gradient is not Lipschitz, compressed SGD with
RandomK compressor converges to an $\varepsilon$-SOSP with the same number of
iterations as uncompressed SGD [Jin et al.,2021] (JACM), while improving the
total communication by a factor of $\tilde \Theta(\sqrt{d}
\varepsilon^{-3/4})$, where $d$ is the dimension of the optimization problem.
We present additional results for the cases when the compressor is arbitrary
and when the stochastic gradient is Lipschitz.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - On Convergence of Incremental Gradient for Non-Convex Smooth Functions [63.51187646914962]
In machine learning and network optimization, algorithms like shuffle SGD are popular due to minimizing the number of misses and good cache.
This paper delves into the convergence properties SGD algorithms with arbitrary data ordering.
arXiv Detail & Related papers (2023-05-30T17:47:27Z) - CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved Convergence [9.11726703830074]
In this paper, we consider solving the distributed optimization problem under the communication restricted setting.
We show the method over com-pressed exact diffusion termed CEDAS"
In particular, when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when
arXiv Detail & Related papers (2023-01-14T09:49:15Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Stochastic Gradient Methods with Compressed Communication for
Decentralized Saddle Point Problems [0.0]
We develop two compression based gradient algorithms to solve a class of non-smooth strongly convex-strongly concave saddle-point problems.
Our first algorithm is a Restart-based Decentralized Proximal Gradient method with Compression (C-RDPSG) for general settings.
Next, we present a Decentralized Proximal Variance Reduced Gradient algorithm with Compression (C-DPSVRG) for finite sum setting.
arXiv Detail & Related papers (2022-05-28T15:17:19Z) - Communication-Efficient Federated Learning via Quantized Compressed
Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server.
Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression.
We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z) - On the Benefits of Multiple Gossip Steps in Communication-Constrained
Decentralized Optimization [29.42301299741866]
We show that having $O(logfrac1epsilon)$ iterations with constant step size - $O(logfrac1epsilon)$ - enables convergence to within $epsilon$ of the optimal value for smooth non- compressed gradient objectives.
To our knowledge, this is the first work that derives the convergence results for non optimization under compressed communication compression parameters.
arXiv Detail & Related papers (2020-11-20T21:17:32Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - On Biased Compression for Distributed Learning [55.89300593805943]
We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings.
We propose several new biased compressors with promising theoretical guarantees and practical performance.
arXiv Detail & Related papers (2020-02-27T19:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.