Rethinking gradient sparsification as total error minimization
- URL: http://arxiv.org/abs/2108.00951v1
- Date: Mon, 2 Aug 2021 14:52:42 GMT
- Title: Rethinking gradient sparsification as total error minimization
- Authors: Atal Narayan Sahu (1), Aritra Dutta (1), Ahmed M. Abdelmoniem (1),
Trambak Banerjee (2), Marco Canini (1), Panos Kalnis (1) ((1) KAUST, (2)
University of Kansas)
- Abstract summary: Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training training networks (DNNs)
We argue that the benefits of gradient sparsification, especially for DNNs, are necessary -- one that moves from perit optimality to consider optimality for the entire training.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient compression is a widely-established remedy to tackle the
communication bottleneck in distributed training of large deep neural networks
(DNNs). Under the error-feedback framework, Top-$k$ sparsification, sometimes
with $k$ as little as $0.1\%$ of the gradient size, enables training to the
same model quality as the uncompressed case for a similar iteration count. From
the optimization perspective, we find that Top-$k$ is the communication-optimal
sparsifier given a per-iteration $k$ element budget. We argue that to further
the benefits of gradient sparsification, especially for DNNs, a different
perspective is necessary -- one that moves from per-iteration optimality to
consider optimality for the entire training.
We identify that the total error -- the sum of the compression errors for all
iterations -- encapsulates sparsification throughout training. Then, we propose
a communication complexity model that minimizes the total error under a
communication budget for the entire training. We find that the hard-threshold
sparsifier, a variant of the Top-$k$ sparsifier with $k$ determined by a
constant hard-threshold, is the optimal sparsifier for this model. Motivated by
this, we provide convex and non-convex convergence analyses for the
hard-threshold sparsifier with error-feedback. Unlike with Top-$k$ sparsifier,
we show that hard-threshold has the same asymptotic convergence and linear
speedup property as SGD in the convex case and has no impact on the
data-heterogeneity in the non-convex case. Our diverse experiments on various
DNNs and a logistic regression model demonstrated that the hard-threshold
sparsifier is more communication-efficient than Top-$k$.
Related papers
- Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback [75.29048190099523]
Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions.
In this paper, we design a fully adaptive OGD algorithm, textsfAdaOGD, that does not require a priori knowledge of these parameters.
arXiv Detail & Related papers (2023-10-21T18:38:13Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach [57.92727189589498]
We propose an online convex optimization approach with two different levels of adaptivity.
We obtain $mathcalO(log V_T)$, $mathcalO(d log V_T)$ and $hatmathcalO(sqrtV_T)$ regret bounds for strongly convex, exp-concave and convex loss functions.
arXiv Detail & Related papers (2023-07-17T09:55:35Z) - Variance-reduced Clipping for Non-convex Optimization [24.765794811146144]
Gradient clipping is a technique used in deep learning applications such as large-scale language modeling.
Recent experimental training have a fairly special behavior in that it mitigates order complexity.
arXiv Detail & Related papers (2023-03-02T00:57:38Z) - Adaptive Top-K in SGD for Communication-Efficient Distributed Learning [14.867068493072885]
This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance.
numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-10-24T18:33:35Z) - High-Dimensional Inference over Networks: Linear Convergence and
Statistical Guarantees [20.701475313495884]
We study a sparse linear regression over a network of agents, modeled as an undirected graph and no server node.
We analyze the convergence rate and statistical guarantees of a distributed projected gradient tracking-based algorithm.
arXiv Detail & Related papers (2022-01-21T01:26:08Z) - Near-Optimal Sparse Allreduce for Distributed Deep Learning [18.99898181586806]
Communication overhead is one of the major obstacles to train large deep learning models at scale.
This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients.
arXiv Detail & Related papers (2022-01-19T13:56:57Z) - Distributed Sparse Regression via Penalization [5.990069843501885]
We study linear regression over a network of agents, modeled as an undirected graph (with no centralized node)
The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint.
We show that the proximal-gradient algorithm applied to the penalized problem converges linearly up to a tolerance of the order of the centralized statistical error.
arXiv Detail & Related papers (2021-11-12T01:51:50Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.