Related papers: 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

URL: http://arxiv.org/abs/2104.06069v1
Date: Tue, 13 Apr 2021 10:07:49 GMT
Title: 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed
Authors: Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He
Abstract summary: We propose a new communication-efficient algorithm, 1-bit LAMB, which supports adaptive layerwise learning rates even when communication is compressed. For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction.
Score: 17.953619054149378
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To train large models (like BERT and GPT-3) with hundreds or even thousands of GPUs, the communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP interconnects network. On one side large-batch optimization such as LAMB algorithm was proposed to reduce the number of communications. On the other side, communication compression algorithms such as 1-bit SGD and 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially on low-bandwidth Ethernet networks. Motivated by this we aim to combine the power of large-batch optimization and communication compression, but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed. In addition, we introduce a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance compared to existing MPI-based implementation. For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end speedup (in terms of number of training samples per second), and the same convergence speed (in terms of number of pre-training samples to reach the same accuracy on fine-tuning tasks) compared to uncompressed LAMB.

Related papers

Sparsity-Aware Communication for Distributed Graph Neural Network Training [0.41942958779358674]
Graph Neural Networks (GNNs) are a computationally efficient method to learn embeddings and classifications on graph data. GNN training has low computational intensity, making communication costs the bottleneck for scalability. We develop sparsity-aware algorithms that tackle the communication bottlenecks in GNN training with three novel approaches.
arXiv Detail & Related papers (2025-04-07T01:53:14Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
Accelerating Large Language Model Training with Hybrid GPU-based Compression [3.204387803072905]
MPI libraries have been proven to reduce message size significantly and leverage interconnect bandwidth. We investigate the efficacy of compression-assisted MPI collectives under the context of distributed Large Language Model (LLM) training.
arXiv Detail & Related papers (2024-09-04T04:05:30Z)
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss. We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality. Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z)
Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [10.233937665979694]
DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. We introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training.
arXiv Detail & Related papers (2024-07-05T05:55:18Z)
Communication-Efficient Federated Learning with Adaptive Compression under Dynamic Bandwidth [6.300376113680886]
Federated learning can train models without directly providing local data to the server. Recent scholars have achieved the communication efficiency of federated learning mainly by model compression. We show the performance of AdapComFL algorithm, and compare it with existing algorithms.
arXiv Detail & Related papers (2024-05-06T08:00:43Z)
LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression [56.01900711954956]
We introduce LoCoDL, a communication-efficient algorithm that leverages the two popular and effective techniques of Local training, which reduces the communication frequency, and Compression, in which short bitstreams are sent instead of full-dimensional vectors of floats. LoCoDL provably benefits from local training and compression and enjoys a doubly-accelerated communication complexity, with respect to the condition number of the functions and the model dimension, in the general heterogenous regime with strongly convex functions.
arXiv Detail & Related papers (2024-03-07T09:22:50Z)
Accelerating Distributed Deep Learning using Lossless Homomorphic Compression [17.654138014999326]
We introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation. We show up to a 6.33$times$ improvement in aggregation throughput and a 3.74$times$ increase in per-iteration training speed.
arXiv Detail & Related papers (2024-02-12T09:57:47Z)
DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining [22.168137965177284]
Communication scheduling has been shown to be effective in accelerating distributed training. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations. We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
arXiv Detail & Related papers (2023-02-24T04:11:18Z)
TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation [53.84175614198885]
In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server. We propose TAMUNA, the first algorithm for distributed optimization that leveraged the two strategies of local training and compression jointly and allows for partial participation.
arXiv Detail & Related papers (2023-02-20T08:37:44Z)
A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free! [72.31332210635524]
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator. We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators. We prove that our method can solve the problems without any increase in the number of communications compared to the baseline.
arXiv Detail & Related papers (2020-11-03T13:35:53Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.