1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training
with LAMB's Convergence Speed
- URL: http://arxiv.org/abs/2104.06069v1
- Date: Tue, 13 Apr 2021 10:07:49 GMT
- Title: 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training
with LAMB's Convergence Speed
- Authors: Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari,
Yuxiong He
- Abstract summary: We propose a new communication-efficient algorithm, 1-bit LAMB, which supports adaptive layerwise learning rates even when communication is compressed.
For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction.
- Score: 17.953619054149378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To train large models (like BERT and GPT-3) with hundreds or even thousands
of GPUs, the communication has become a major bottleneck, especially on
commodity systems with limited-bandwidth TCP interconnects network. On one side
large-batch optimization such as LAMB algorithm was proposed to reduce the
number of communications. On the other side, communication compression
algorithms such as 1-bit SGD and 1-bit Adam help to reduce the volume of each
communication. However, we find that simply using one of the techniques is not
sufficient to solve the communication challenge, especially on low-bandwidth
Ethernet networks. Motivated by this we aim to combine the power of large-batch
optimization and communication compression, but we find that existing
compression strategies cannot be directly applied to LAMB due to its unique
adaptive layerwise learning rates. To this end, we design a new
communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to
support adaptive layerwise learning rates even when communication is
compressed. In addition, we introduce a new system implementation for
compressed communication using the NCCL backend of PyTorch distributed, which
improves both usability and performance compared to existing MPI-based
implementation. For BERT-Large pre-training task with batch sizes from 8K to
64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with
NCCL-based backend is able to achieve up to 4.6x communication volume
reduction, up to 2.8x end-to-end speedup (in terms of number of training
samples per second), and the same convergence speed (in terms of number of
pre-training samples to reach the same accuracy on fine-tuning tasks) compared
to uncompressed LAMB.
Related papers
- LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.
We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.
Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [10.233937665979694]
DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications.
A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices.
We introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training.
arXiv Detail & Related papers (2024-07-05T05:55:18Z) - Communication-Efficient Federated Learning with Adaptive Compression under Dynamic Bandwidth [6.300376113680886]
Federated learning can train models without directly providing local data to the server.
Recent scholars have achieved the communication efficiency of federated learning mainly by model compression.
We show the performance of AdapComFL algorithm, and compare it with existing algorithms.
arXiv Detail & Related papers (2024-05-06T08:00:43Z) - Accelerating Distributed Deep Learning using Lossless Homomorphic
Compression [17.654138014999326]
We introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation.
We show up to a 6.33$times$ improvement in aggregation throughput and a 3.74$times$ increase in per-iteration training speed.
arXiv Detail & Related papers (2024-02-12T09:57:47Z) - DeAR: Accelerating Distributed Deep Learning with Fine-Grained
All-Reduce Pipelining [22.168137965177284]
Communication scheduling has been shown to be effective in accelerating distributed training.
We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations.
We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
arXiv Detail & Related papers (2023-02-24T04:11:18Z) - TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation [53.84175614198885]
In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server.
We propose TAMUNA, the first algorithm for distributed optimization that leveraged the two strategies of local training and compression jointly and allows for partial participation.
arXiv Detail & Related papers (2023-02-20T08:37:44Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - A Linearly Convergent Algorithm for Decentralized Optimization: Sending
Less Bits for Free! [72.31332210635524]
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator.
We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators.
We prove that our method can solve the problems without any increase in the number of communications compared to the baseline.
arXiv Detail & Related papers (2020-11-03T13:35:53Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - Is Network the Bottleneck of Distributed Training? [36.925680383195356]
We take a first-principles approach to measure and analyze the network performance of distributed training.
We find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one.
arXiv Detail & Related papers (2020-06-17T19:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.