A Communication-Efficient Distributed Gradient Clipping Algorithm for
Training Deep Neural Networks
- URL: http://arxiv.org/abs/2205.05040v1
- Date: Tue, 10 May 2022 16:55:33 GMT
- Title: A Communication-Efficient Distributed Gradient Clipping Algorithm for
Training Deep Neural Networks
- Authors: Mingrui Liu, Zhenxun Zhuang, Yunwei Lei, Chunyang Liao
- Abstract summary: Gradient Descent might converge slowly in some deep neural networks.
It remains mysterious whether gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
- Score: 11.461878019780597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In distributed training of deep neural networks or Federated Learning (FL),
people usually run Stochastic Gradient Descent (SGD) or its variants on each
machine and communicate with other machines periodically. However, SGD might
converge slowly in training some deep neural networks (e.g., RNN, LSTM) because
of the exploding gradient issue. Gradient clipping is usually employed to
address this issue in the single machine setting, but exploring this technique
in the FL setting is still in its infancy: it remains mysterious whether the
gradient clipping scheme can take advantage of multiple machines to enjoy
parallel speedup. The main technical difficulty lies in dealing with nonconvex
loss function, non-Lipschitz continuous gradient, and skipping communication
rounds simultaneously. In this paper, we explore a relaxed-smoothness
assumption of the loss landscape which LSTM was shown to satisfy in previous
works and design a communication-efficient gradient clipping algorithm. This
algorithm can be run on multiple machines, where each machine employs a
gradient clipping scheme and communicate with other machines after multiple
steps of gradient-based updates. Our algorithm is proved to have
$O\left(\frac{1}{N\epsilon^4}\right)$ iteration complexity for finding an
$\epsilon$-stationary point, where $N$ is the number of machines. This
indicates that our algorithm enjoys linear speedup. We prove this result by
introducing novel analysis techniques of estimating truncated random variables,
which we believe are of independent interest. Our experiments on several
benchmark datasets and various scenarios demonstrate that our algorithm indeed
exhibits fast convergence speed in practice and thus validates our theory.
Related papers
- A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent [57.64826450787237]
We show how to analyze the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions.
We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm.
Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
arXiv Detail & Related papers (2024-07-19T08:29:12Z) - Distributed Extra-gradient with Optimal Complexity and Communication
Guarantees [60.571030754252824]
We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local dual vectors.
Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communication-efficient.
We propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs.
arXiv Detail & Related papers (2023-08-17T21:15:04Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Training Overparametrized Neural Networks in Sublinear Time [14.918404733024332]
Deep learning comes at a tremendous computational and energy cost.
We present a new and a subset of binary neural networks, as a small subset of search trees, where each corresponds to a subset of search trees (Ds)
We believe this view would have further applications in analysis analysis of deep networks (Ds)
arXiv Detail & Related papers (2022-08-09T02:29:42Z) - Hidden Progress in Deep Learning: SGD Learns Parities Near the
Computational Limit [36.17720004582283]
This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits.
We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
arXiv Detail & Related papers (2022-07-18T17:55:05Z) - Combinatorial optimization for low bit-width neural networks [23.466606660363016]
Low-bit width neural networks have been extensively explored for deployment on edge devices to reduce computational resources.
Existing approaches have focused on gradient-based optimization in a two-stage train-and-compress setting.
We show that a combination of greedy coordinate descent and this novel approach can attain competitive accuracy on binary classification tasks.
arXiv Detail & Related papers (2022-06-04T15:02:36Z) - A quantum algorithm for training wide and deep classical neural networks [72.2614468437919]
We show that conditions amenable to classical trainability via gradient descent coincide with those necessary for efficiently solving quantum linear systems.
We numerically demonstrate that the MNIST image dataset satisfies such conditions.
We provide empirical evidence for $O(log n)$ training of a convolutional neural network with pooling.
arXiv Detail & Related papers (2021-07-19T23:41:03Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.