Related papers: ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

URL: http://arxiv.org/abs/2306.10209v1
Date: Fri, 16 Jun 2023 23:26:19 GMT
Title: ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Authors: Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He
Abstract summary: This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.
Score: 14.608109247317154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.

Related papers

Sparsity-Aware Communication for Distributed Graph Neural Network Training [0.41942958779358674]
Graph Neural Networks (GNNs) are a computationally efficient method to learn embeddings and classifications on graph data. GNN training has low computational intensity, making communication costs the bottleneck for scalability. We develop sparsity-aware algorithms that tackle the communication bottlenecks in GNN training with three novel approaches.
arXiv Detail & Related papers (2025-04-07T01:53:14Z)
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch [66.84195842685459]
Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint. We show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before.
arXiv Detail & Related papers (2025-01-30T17:23:50Z)
Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning [2.685330831042324]
We propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization. For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs.
arXiv Detail & Related papers (2025-01-08T04:19:57Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
Gravity-aligned Rotation Averaging with Circular Regression [53.81374943525774]
We introduce a principled approach that integrates gravity direction into the rotation averaging phase of global pipelines. We achieve state-of-the-art accuracy on four large-scale datasets.
arXiv Detail & Related papers (2024-10-16T17:37:43Z)
High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model. In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z)
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss. We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality. Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives [1.908240145212707]
Large Language Models increasingly rely on distributed techniques for their training and inference. Such techniques inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. We propose T3, which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute.
arXiv Detail & Related papers (2024-01-30T01:55:34Z)
Rethinking Memory and Communication Cost for Efficient Large Language Model Training [25.640899145028296]
We rethink the impact of memory consumption and communication costs on the training speed of large language models. Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method. The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.
arXiv Detail & Related papers (2023-10-09T15:08:32Z)
Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry. We identify two major challenges in the existing GPU training for massivescale ad models. We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z)
Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL [1.5528708400965123]
We present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by $17%$.
arXiv Detail & Related papers (2021-11-08T23:20:52Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection [13.963329236804586]
We introduce a novel decentralized training algorithm with the following key features. Each worker only needs to communicate with a single peer at each communication round with a highly compressed model. Experimental results show that our algorithm significantly reduces the communication traffic and generally selects relatively high bandwidth peers.
arXiv Detail & Related papers (2020-02-22T12:31:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.