Efficient-Adam: Communication-Efficient Distributed Adam
- URL: http://arxiv.org/abs/2205.14473v2
- Date: Thu, 24 Aug 2023 09:44:29 GMT
- Title: Efficient-Adam: Communication-Efficient Distributed Adam
- Authors: Congliang Chen, Li Shen, Wei Liu and Zhi-Quan Luo
- Abstract summary: We present a novel communication complexity.
$efficient distributed Adam model.
Two-way quantization to reduce the cost between the server and workers.
- Score: 28.287237692902476
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed adaptive stochastic gradient methods have been widely used for
large-scale nonconvex optimization, such as training deep learning models.
However, their communication complexity on finding $\varepsilon$-stationary
points has rarely been analyzed in the nonconvex setting. In this work, we
present a novel communication-efficient distributed Adam in the
parameter-server model for stochastic nonconvex optimization, dubbed {\em
Efficient-Adam}. Specifically, we incorporate a two-way quantization scheme
into Efficient-Adam to reduce the communication cost between the workers and
server. Simultaneously, we adopt a two-way error feedback strategy to reduce
the biases caused by the two-way quantization on both the server and workers,
respectively. In addition, we establish the iteration complexity for the
proposed Efficient-Adam with a class of quantization operators, and further
characterize its communication complexity between the server and workers when
an $\varepsilon$-stationary point is achieved. Finally, we apply Efficient-Adam
to solve a toy stochastic convex optimization problem and train deep learning
models on real-world vision and language tasks. Extensive experiments together
with a theoretical guarantee justify the merits of Efficient Adam.
Related papers
- Towards Communication-efficient Federated Learning via Sparse and Aligned Adaptive Optimization [65.85963235502322]
Federated Adam (FedAdam) algorithms suffer from a threefold increase in uplink communication overhead.
We propose a novel sparse FedAdam algorithm called FedAdam-SSM, wherein distributed devices sparsify the updates local model parameters and moment estimates.
By minimizing the divergence bound between the model trained by FedAdam-SSM and centralized Adam, we optimize the SSM to mitigate the learning performance degradation caused by sparsification error.
arXiv Detail & Related papers (2024-05-28T07:56:49Z) - AdamL: A fast adaptive gradient method incorporating loss function [1.6025685183216696]
We propose AdamL, a novel variant of the Adam, that takes into account the loss function information to attain better results.
We show that AdamL achieves either the fastest convergence or the lowest objective function values when compared to Adam, EAdam, and AdaBelief.
In the case of vanilla convolutional neural networks, AdamL stands out from the other Adam's variants and does not require the manual adjustment of the learning rate during the later stage of the training.
arXiv Detail & Related papers (2023-12-23T16:32:29Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Decentralized Gossip-Based Stochastic Bilevel Optimization over
Communication Networks [42.76623191830371]
We propose a gossip-based distributed bilevel optimization algorithm.
Agents can solve both networked and outer problems in a single time.
Our algorithm achieves the state-of-the-art efficiency and test accuracy.
arXiv Detail & Related papers (2022-06-22T06:38:54Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Minimax
Problems [80.46370778277186]
Large scale convex-concave minimax problems arise in numerous applications, including game theory, robust training, and training of generative adversarial networks.
We develop a communication-efficient distributed extragrad algorithm, LocalAdaSient, with an adaptive learning rate suitable for solving convex-concave minimax problem in the.
Server model.
We demonstrate its efficacy through several experiments in both the homogeneous and heterogeneous settings.
arXiv Detail & Related papers (2021-06-18T09:42:05Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Towards Practical Adam: Non-Convexity, Convergence Theory, and
Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks.
Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge.
We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z) - APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD
Algorithm [39.110478306078974]
Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet.
We propose a communication efficient bf ADAM bf preconditioned bf Momentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients.
arXiv Detail & Related papers (2020-08-26T02:20:23Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.