ZeRO++: Extremely Efficient Collective Communication for Giant Model
Training
- URL: http://arxiv.org/abs/2306.10209v1
- Date: Fri, 16 Jun 2023 23:26:19 GMT
- Title: ZeRO++: Extremely Efficient Collective Communication for Giant Model
Training
- Authors: Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam
Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He
- Abstract summary: This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO.
Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.
- Score: 14.608109247317154
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large
language models on massive GPUs clusters due to its ease of use, efficiency,
and good scalability. However, when training on low-bandwidth clusters, or at
scale which forces batch size per GPU to be small, ZeRO's effective throughput
is limited because of high communication volume from gathering weights in
forward pass, backward pass, and averaging gradients. This paper introduces
three communication volume reduction techniques, which we collectively refer to
as ZeRO++, targeting each of the communication collectives in ZeRO. First is
block-quantization based all-gather. Second is data remapping that trades-off
communication for more memory. Third is a novel all-to-all based quantized
gradient averaging paradigm as replacement of reduce-scatter collective, which
preserves accuracy despite communicating low precision data. Collectively,
ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better
throughput at 384 GPU scale.
Related papers
- Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch [66.84195842685459]
Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time.
Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint.
We show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before.
arXiv Detail & Related papers (2025-01-30T17:23:50Z) - Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning [2.685330831042324]
We propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization.
For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs.
arXiv Detail & Related papers (2025-01-08T04:19:57Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - Gravity-aligned Rotation Averaging with Circular Regression [53.81374943525774]
We introduce a principled approach that integrates gravity direction into the rotation averaging phase of global pipelines.
We achieve state-of-the-art accuracy on four large-scale datasets.
arXiv Detail & Related papers (2024-10-16T17:37:43Z) - LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.
We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.
Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - T3: Transparent Tracking & Triggering for Fine-grained Overlap of
Compute & Collectives [1.908240145212707]
Large Language Models increasingly rely on distributed techniques for their training and inference.
Such techniques inherently serialize communication with model execution.
One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner.
We propose T3, which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute.
arXiv Detail & Related papers (2024-01-30T01:55:34Z) - Rethinking Memory and Communication Cost for Efficient Large Language
Model Training [25.640899145028296]
We rethink the impact of memory consumption and communication costs on the training speed of large language models.
Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method.
The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.
arXiv Detail & Related papers (2023-10-09T15:08:32Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - Synthesizing Collective Communication Algorithms for Heterogeneous
Networks with TACCL [1.5528708400965123]
We present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems.
TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms.
Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by $17%$.
arXiv Detail & Related papers (2021-11-08T23:20:52Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.