Related papers: 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

URL: http://arxiv.org/abs/2102.02888v1
Date: Thu, 4 Feb 2021 21:02:19 GMT
Title: 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed
Authors: Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He
Abstract summary: Communication has become a major bottleneck on commodity systems with standard TCP interconnects that offer limited network bandwidth. One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression. We propose 1-bit Adam that reduces the communication volume by up to $5times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam.
Score: 39.23129626683372
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to $5\times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam's variance (non-linear term) becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). Experiments on up to 256 GPUs show that 1-bit Adam enables up to $3.3\times$ higher throughput for BERT-Large pre-training and up to $2.9\times$ higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for our proposed work.

Related papers

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss. We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality. Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam [49.426602335460295]
1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD. We propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel methods.
arXiv Detail & Related papers (2022-02-12T08:02:23Z)
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning. ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm [39.110478306078974]
Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet. We propose a communication efficient bf ADAM bf preconditioned bf Momentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients.
arXiv Detail & Related papers (2020-08-26T02:20:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.