Related papers: Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

URL: http://arxiv.org/abs/2110.04478v1
Date: Sat, 9 Oct 2021 06:50:04 GMT
Title: Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models
Authors: Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna
Abstract summary: Distributed training is a solution to reduce training time by splitting the task across multiple NPUs. Themis is a novel collective scheduling scheme that dynamically schedules collectives to balance the communication loads across all dimensions. Our results show that on average, Themis can improve the network BW utilization of single All-Reduce by 1.88x (2.92x max)
Score: 2.6599014990168834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The continuous growth in both size and training data for modern Deep Neural Networks (DNNs) models has led to training tasks taking days or even months. Distributed training is a solution to reduce training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In today's datacenters, for training at scale, NPUs are connected through multi-dimensional interconnection links with different bandwidth and latency. Hence, keeping all network dimensions busy and maximizing the network BW is a challenging task in such a hybrid network environment, as this work identifies. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of single All-Reduce by 1.88x (2.92x max), and improve the end-to-end training iteration performance of real workloads such as ResNet-50, GNMT, DLRM, and Transformer- 1T by 1.49x (1.96x max), 1.41x (1.81x max), 1.42x (1.80x max), and 1.35x (1.78x max), respectively.

Related papers

DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster [7.597885871452736]
We propose DiLoCoX, a low-communication large-scale decentralized cluster training framework.<n>It combines Pipeline Parallelism with Dual-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme.<n>We show that DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence.
arXiv Detail & Related papers (2025-06-26T13:45:04Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models [7.605379124802678]
Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators. We propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively.
arXiv Detail & Related papers (2024-06-28T00:05:53Z)
Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch [72.26822499434446]
Auto-Train-Once (ATO) is an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs. We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures.
arXiv Detail & Related papers (2024-03-21T02:33:37Z)
BLoad: Enhancing Neural Network Training with Efficient Sequential Data Handling [8.859850475075238]
We propose a novel training scheme that enables efficient distributed data-parallel training on sequences of different sizes with minimal overhead. By using this scheme we were able to reduce the padding amount by more than 100$x$ while not deleting a single frame, resulting in an overall increased performance on both training time and Recall.
arXiv Detail & Related papers (2023-10-16T23:14:56Z)
DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining [22.168137965177284]
Communication scheduling has been shown to be effective in accelerating distributed training. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations. We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
arXiv Detail & Related papers (2023-02-24T04:11:18Z)
Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO) [0.0]
We introduce the Distributed Asynchronous and Selective Optimization (DASO) method to accelerate network training. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks.
arXiv Detail & Related papers (2021-04-12T16:02:20Z)
All at Once Network Quantization via Collaborative Knowledge Transfer [56.95849086170461]
We develop a novel collaborative knowledge transfer approach for efficiently training the all-at-once quantization network. Specifically, we propose an adaptive selection strategy to choose a high-precision enquoteteacher for transferring knowledge to the low-precision student. To effectively transfer knowledge, we develop a dynamic block swapping method by randomly replacing the blocks in the lower-precision student network with the corresponding blocks in the higher-precision teacher network.
arXiv Detail & Related papers (2021-03-02T03:09:03Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
Is Network the Bottleneck of Distributed Training? [36.925680383195356]
We take a first-principles approach to measure and analyze the network performance of distributed training. We find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one.
arXiv Detail & Related papers (2020-06-17T19:00:31Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.