IntSGD: Floatless Compression of Stochastic Gradients
- URL: http://arxiv.org/abs/2102.08374v1
- Date: Tue, 16 Feb 2021 18:58:57 GMT
- Title: IntSGD: Floatless Compression of Stochastic Gradients
- Authors: Konstantin Mishchenko and Bokun Wang and Dmitry Kovalev and Peter
Richt\'arik
- Abstract summary: We propose a family of loss integer compressions for Gradient Descent (SGD) that do not communicate by multiplying a single float.
We show that when the data is significantly heterogeneous, it may become increasingly hard to keep the integers and propose an alternative algorithm, IntyA, to solve this type of problems.
- Score: 19.99615698375829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a family of lossy integer compressions for Stochastic Gradient
Descent (SGD) that do not communicate a single float. This is achieved by
multiplying floating-point vectors with a number known to every device and then
rounding to an integer number. Our theory shows that the iteration complexity
of SGD does not change up to constant factors when the vectors are scaled
properly. Moreover, this holds for both convex and non-convex functions, with
and without overparameterization. In contrast to other compression-based
algorithms, ours preserves the convergence rate of SGD even on non-smooth
problems. Finally, we show that when the data is significantly heterogeneous,
it may become increasingly hard to keep the integers bounded and propose an
alternative algorithm, IntDIANA, to solve this type of problems.
Related papers
- Demystifying the Myths and Legends of Nonconvex Convergence of SGD [17.445810977264067]
gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems.
As our analyses, we addressed certain myths and legends related to the non convergence of the gradient.
arXiv Detail & Related papers (2023-10-19T17:58:59Z) - On Convergence of Incremental Gradient for Non-Convex Smooth Functions [63.51187646914962]
In machine learning and network optimization, algorithms like shuffle SGD are popular due to minimizing the number of misses and good cache.
This paper delves into the convergence properties SGD algorithms with arbitrary data ordering.
arXiv Detail & Related papers (2023-05-30T17:47:27Z) - Multi-block-Single-probe Variance Reduced Estimator for Coupled
Compositional Optimization [49.58290066287418]
We propose a novel method named Multi-block-probe Variance Reduced (MSVR) to alleviate the complexity of compositional problems.
Our results improve upon prior ones in several aspects, including the order of sample complexities and dependence on strongity.
arXiv Detail & Related papers (2022-07-18T12:03:26Z) - On the Convergence to a Global Solution of Shuffling-Type Gradient
Algorithms [18.663264755108703]
gradient descent (SGD) algorithm is the method of choice in many machine learning tasks.
In this paper, we show that SGD has achieved the desired computational general complexity as convex setting.
arXiv Detail & Related papers (2022-06-13T01:25:59Z) - On the Convergence of mSGD and AdaGrad for Stochastic Optimization [0.696125353550498]
convex descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade.
Some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient optimization (AdaGrad)
We focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-possibly non-possibly non-possibly) loss functions in machine learning.
arXiv Detail & Related papers (2022-01-26T22:02:21Z) - Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned
Problems [55.40911408462676]
We show that without-replacement SGD emphdoes not significantly improve on with-replacement SGD in terms of worst-case bounds.
Since many problems in machine learning and other areas are both ill-conditioned and involve large datasets, this indicates that without-replacement does not necessarily improve over with-replacement sampling for realistic budgets.
arXiv Detail & Related papers (2021-06-12T23:07:27Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Provably Convergent Working Set Algorithm for Non-Convex Regularized
Regression [0.0]
This paper proposes a working set algorithm for non-regular regularizers with convergence guarantees.
Our results demonstrate high gain compared to the full problem solver for both block-coordinates or a gradient solver.
arXiv Detail & Related papers (2020-06-24T07:40:31Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.