1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training
with LAMB's Convergence Speed
- URL: http://arxiv.org/abs/2104.06069v1
- Date: Tue, 13 Apr 2021 10:07:49 GMT
- Title: 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training
with LAMB's Convergence Speed
- Authors: Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari,
Yuxiong He
- Abstract summary: We propose a new communication-efficient algorithm, 1-bit LAMB, which supports adaptive layerwise learning rates even when communication is compressed.
For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction.
- Score: 17.953619054149378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To train large models (like BERT and GPT-3) with hundreds or even thousands
of GPUs, the communication has become a major bottleneck, especially on
commodity systems with limited-bandwidth TCP interconnects network. On one side
large-batch optimization such as LAMB algorithm was proposed to reduce the
number of communications. On the other side, communication compression
algorithms such as 1-bit SGD and 1-bit Adam help to reduce the volume of each
communication. However, we find that simply using one of the techniques is not
sufficient to solve the communication challenge, especially on low-bandwidth
Ethernet networks. Motivated by this we aim to combine the power of large-batch
optimization and communication compression, but we find that existing
compression strategies cannot be directly applied to LAMB due to its unique
adaptive layerwise learning rates. To this end, we design a new
communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to
support adaptive layerwise learning rates even when communication is
compressed. In addition, we introduce a new system implementation for
compressed communication using the NCCL backend of PyTorch distributed, which
improves both usability and performance compared to existing MPI-based
implementation. For BERT-Large pre-training task with batch sizes from 8K to
64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with
NCCL-based backend is able to achieve up to 4.6x communication volume
reduction, up to 2.8x end-to-end speedup (in terms of number of training
samples per second), and the same convergence speed (in terms of number of
pre-training samples to reach the same accuracy on fine-tuning tasks) compared
to uncompressed LAMB.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Accelerating Large Language Model Training with Hybrid GPU-based Compression [3.204387803072905]
MPI libraries have been proven to reduce message size significantly and leverage interconnect bandwidth.
We investigate the efficacy of compression-assisted MPI collectives under the context of distributed Large Language Model (LLM) training.
arXiv Detail & Related papers (2024-09-04T04:05:30Z) - LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.
We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.
Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [10.233937665979694]
DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications.
A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices.
We introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training.
arXiv Detail & Related papers (2024-07-05T05:55:18Z) - Communication-Efficient Federated Learning with Adaptive Compression under Dynamic Bandwidth [6.300376113680886]
Federated learning can train models without directly providing local data to the server.
Recent scholars have achieved the communication efficiency of federated learning mainly by model compression.
We show the performance of AdapComFL algorithm, and compare it with existing algorithms.
arXiv Detail & Related papers (2024-05-06T08:00:43Z) - Accelerating Distributed Deep Learning using Lossless Homomorphic
Compression [17.654138014999326]
We introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation.
We show up to a 6.33$times$ improvement in aggregation throughput and a 3.74$times$ increase in per-iteration training speed.
arXiv Detail & Related papers (2024-02-12T09:57:47Z) - DeAR: Accelerating Distributed Deep Learning with Fine-Grained
All-Reduce Pipelining [22.168137965177284]
Communication scheduling has been shown to be effective in accelerating distributed training.
We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations.
We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
arXiv Detail & Related papers (2023-02-24T04:11:18Z) - TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation [53.84175614198885]
In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server.
We propose TAMUNA, the first algorithm for distributed optimization that leveraged the two strategies of local training and compression jointly and allows for partial participation.
arXiv Detail & Related papers (2023-02-20T08:37:44Z) - A Linearly Convergent Algorithm for Decentralized Optimization: Sending
Less Bits for Free! [72.31332210635524]
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator.
We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators.
We prove that our method can solve the problems without any increase in the number of communications compared to the baseline.
arXiv Detail & Related papers (2020-11-03T13:35:53Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.