Synthesizing Collective Communication Algorithms for Heterogeneous
Networks with TACCL
- URL: http://arxiv.org/abs/2111.04867v1
- Date: Mon, 8 Nov 2021 23:20:52 GMT
- Title: Synthesizing Collective Communication Algorithms for Heterogeneous
Networks with TACCL
- Authors: Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan
Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, Rachee Singh
- Abstract summary: We present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems.
TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms.
Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by $17%$.
- Score: 1.5528708400965123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large ML models and datasets have necessitated the use of multi-GPU systems
for distributed model training. To harness the power offered by multi-GPU
systems, it is critical to eliminate bottlenecks in inter-GPU communication - a
problem made challenging by the heterogeneous nature of interconnects. In this
work, we present TACCL, a synthesizer for collective communication primitives
for large-scale multi-GPU systems. TACCL encodes a profiled topology and input
size into a synthesis problem to generate optimized communication algorithms.
TACCL is built on top of the standard NVIDIA Collective Communication Library
(NCCL), allowing it to be a drop-in replacement for GPU communication in
frameworks like PyTorch with minimal changes. TACCL generates algorithms for
communication primitives like Allgather, Alltoall, and Allreduce that are up to
$3\times$ faster than NCCL. Using TACCL's algorithms speeds up the end-to-end
training of an internal mixture of experts model by $17\%$. By decomposing the
optimization problem into parts and leveraging the symmetry in multi-GPU
topologies, TACCL synthesizes collectives for up to 80-GPUs in less than 3
minutes, at least two orders of magnitude faster than other synthesis-based
state-of-the-art collective communication libraries.
Related papers
- Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning [2.685330831042324]
We propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization.
For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs.
arXiv Detail & Related papers (2025-01-08T04:19:57Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [9.5114389643299]
This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs.
Flux can potentially overlap up to 96% of communication given a fused kernel.
Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPU with various GPU generations and interconnects.
arXiv Detail & Related papers (2024-06-11T00:17:39Z) - CORE: Common Random Reconstruction for Distributed Optimization with
Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers.
We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z) - TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning [9.196825913937472]
This paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms.
TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds.
It achieves up to a 4.27x performance improvement over state-of-the-art synthesizers.
arXiv Detail & Related papers (2023-04-11T15:50:54Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.