FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
- URL: http://arxiv.org/abs/2410.12707v1
- Date: Wed, 16 Oct 2024 16:13:19 GMT
- Title: FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
- Authors: Zhenheng Tang, Xueze Kang, Yiming Yin, Xinglin Pan, Yuxin Wang, Xin He, Qiang Wang, Rongfei Zeng, Kaiyong Zhao, Shaohuai Shi, Amelie Chi Zhou, Bo Li, Bingsheng He, Xiaowen Chu,
- Abstract summary: Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
- Score: 55.992528247880685
- License:
- Abstract: To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
Related papers
- OTOv3: Automatic Architecture-Agnostic Neural Network Training and
Compression from Structured Pruning to Erasing Operators [57.145175475579315]
This topic spans various techniques, from structured pruning to neural architecture search, encompassing both pruning and erasing operators perspectives.
We introduce the third-generation Only-Train-Once (OTOv3), which first automatically trains and compresses a general DNN through pruning and erasing operations.
Our empirical results demonstrate the efficacy of OTOv3 across various benchmarks in structured pruning and neural architecture search.
arXiv Detail & Related papers (2023-12-15T00:22:55Z) - Accelerating Split Federated Learning over Wireless Communication
Networks [17.97006656280742]
We consider a split federated learning (SFL) framework that combines the parallel model training mechanism of federated learning (FL) and the model splitting structure of split learning (SL)
We formulate a joint problem of split point selection and bandwidth allocation to minimize the system latency.
Experiment results demonstrate the superiority of our work in latency reduction and accuracy improvement.
arXiv Detail & Related papers (2023-10-24T07:49:56Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - DiviML: A Module-based Heuristic for Mapping Neural Networks onto
Heterogeneous Platforms [5.970091958678456]
We develop an approach for compiler-level partitioning of deep neural networks (DNNs) onto multiple interconnected hardware devices.
Our scheduler integrates both an exact solver, through a mixed integer linear programming (MILP) formulation, and a modularity-based runtime.
We show how we can extend our framework to schedule large language models across multiple heterogeneous servers.
arXiv Detail & Related papers (2023-07-31T19:46:49Z) - Expediting Distributed DNN Training with Device Topology-Aware Graph
Deployment [18.021259939659874]
TAG is an automatic system to derive optimized DNN training graph and its deployment onto any device topology.
We show that it can achieve up to 4.56x training speed-up as compared to existing schemes.
arXiv Detail & Related papers (2023-02-13T06:30:24Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Accelerating Distributed K-FAC with Smart Parallelism of Computing and
Communication Tasks [13.552262050816616]
Kronecker-Factored Approximate Curvature (KFAC) is one of the most efficient approximation algorithms for training deep models.
Yet, when leveraging GPU clusters to train models with KFAC, it incurs extensive computation as well as introduces extra communications during each iteration.
We propose D-KFAC with smart parallelism of computing and communication tasks to reduce the iteration time.
arXiv Detail & Related papers (2021-07-14T08:01:07Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.