Related papers: Rethinking gradient sparsification as total error minimization

Rethinking gradient sparsification as total error minimization

URL: http://arxiv.org/abs/2108.00951v1
Date: Mon, 2 Aug 2021 14:52:42 GMT
Title: Rethinking gradient sparsification as total error minimization
Authors: Atal Narayan Sahu (1), Aritra Dutta (1), Ahmed M. Abdelmoniem (1), Trambak Banerjee (2), Marco Canini (1), Panos Kalnis (1) ((1) KAUST, (2) University of Kansas)
Abstract summary: Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training training networks (DNNs) We argue that the benefits of gradient sparsification, especially for DNNs, are necessary -- one that moves from perit optimality to consider optimality for the entire training.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-$k$ sparsification, sometimes with $k$ as little as $0.1\%$ of the gradient size, enables training to the same model quality as the uncompressed case for a similar iteration count. From the optimization perspective, we find that Top-$k$ is the communication-optimal sparsifier given a per-iteration $k$ element budget. We argue that to further the benefits of gradient sparsification, especially for DNNs, a different perspective is necessary -- one that moves from per-iteration optimality to consider optimality for the entire training. We identify that the total error -- the sum of the compression errors for all iterations -- encapsulates sparsification throughout training. Then, we propose a communication complexity model that minimizes the total error under a communication budget for the entire training. We find that the hard-threshold sparsifier, a variant of the Top-$k$ sparsifier with $k$ determined by a constant hard-threshold, is the optimal sparsifier for this model. Motivated by this, we provide convex and non-convex convergence analyses for the hard-threshold sparsifier with error-feedback. Unlike with Top-$k$ sparsifier, we show that hard-threshold has the same asymptotic convergence and linear speedup property as SGD in the convex case and has no impact on the data-heterogeneity in the non-convex case. Our diverse experiments on various DNNs and a logistic regression model demonstrated that the hard-threshold sparsifier is more communication-efficient than Top-$k$.

Related papers

Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$. We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z)
Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback [75.29048190099523]
Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions. In this paper, we design a fully adaptive OGD algorithm, textsfAdaOGD, that does not require a priori knowledge of these parameters.
arXiv Detail & Related papers (2023-10-21T18:38:13Z)
Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost. We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z)
Variance-reduced Clipping for Non-convex Optimization [24.765794811146144]
Gradient clipping is a technique used in deep learning applications such as large-scale language modeling. Recent experimental training have a fairly special behavior in that it mitigates order complexity.
arXiv Detail & Related papers (2023-03-02T00:57:38Z)
Adaptive Top-K in SGD for Communication-Efficient Distributed Learning [14.867068493072885]
This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance. numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-10-24T18:33:35Z)
High-Dimensional Inference over Networks: Linear Convergence and Statistical Guarantees [20.701475313495884]
We study a sparse linear regression over a network of agents, modeled as an undirected graph and no server node. We analyze the convergence rate and statistical guarantees of a distributed projected gradient tracking-based algorithm.
arXiv Detail & Related papers (2022-01-21T01:26:08Z)
Near-Optimal Sparse Allreduce for Distributed Deep Learning [18.99898181586806]
Communication overhead is one of the major obstacles to train large deep learning models at scale. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients.
arXiv Detail & Related papers (2022-01-19T13:56:57Z)
Distributed Sparse Regression via Penalization [5.990069843501885]
We study linear regression over a network of agents, modeled as an undirected graph (with no centralized node) The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint. We show that the proximal-gradient algorithm applied to the penalized problem converges linearly up to a tolerance of the order of the centralized statistical error.
arXiv Detail & Related papers (2021-11-12T01:51:50Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples. We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.