BAGUA: Scaling up Distributed Learning with System Relaxations
- URL: http://arxiv.org/abs/2107.01499v2
- Date: Tue, 6 Jul 2021 08:18:02 GMT
- Title: BAGUA: Scaling up Distributed Learning with System Relaxations
- Authors: Shaoduo Gan, Xiangru Lian, Rui Wang, Jianbin Chang, Chengjun Liu,
Hongmei Shi, Shengzhuo Zhang, Xianghong Li, Tengxu Sun, Jiawei Jiang, Binhang
Yuan, Sen Yang, Ji Liu, Ce Zhang
- Abstract summary: BAGUA is a new communication framework for distributed data-parallel training.
Powered by the new system design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms.
In a production cluster with up to 16 machines, BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time.
- Score: 31.500494636704598
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent years have witnessed a growing list of systems for distributed
data-parallel training. Existing systems largely fit into two paradigms, i.e.,
parameter server and MPI-style collective operations. On the algorithmic side,
researchers have proposed a wide range of techniques to lower the communication
via system relaxations: quantization, decentralization, and communication
delay. However, most, if not all, existing systems only rely on standard
synchronous and asynchronous stochastic gradient (SG) based optimization,
therefore, cannot take advantage of all possible optimizations that the machine
learning community has been developing recently. Given this emerging gap
between the current landscapes of systems and theory, we build BAGUA, a
communication framework whose design goal is to provide a system abstraction
that is both flexible and modular to support state-of-the-art system relaxation
techniques of distributed training. Powered by the new system design, BAGUA has
a great ability to implement and extend various state-of-the-art distributed
learning algorithms. In a production cluster with up to 16 machines (128 GPUs),
BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training
time by a significant margin (up to 1.95 times) across a diverse range of
tasks. Moreover, we conduct a rigorous tradeoff exploration showing that
different algorithms and system relaxations achieve the best performance over
different network conditions.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Multi-Job Intelligent Scheduling with Cross-Device Federated Learning [65.69079337653994]
Federated Learning (FL) enables collaborative global machine learning model training without sharing sensitive raw data.
We propose a novel multi-job FL framework, which enables the training process of multiple jobs in parallel.
We propose a novel intelligent scheduling approach based on multiple scheduling methods, including an original reinforcement learning-based scheduling method and an original Bayesian optimization-based scheduling method.
arXiv Detail & Related papers (2022-11-24T06:17:40Z) - Supernet Training for Federated Image Classification under System
Heterogeneity [15.2292571922932]
In this work, we propose a novel framework to consider both scenarios, namely Federation of Supernet Training (FedSup)
It is inspired by how averaging parameters in the model aggregation stage of Federated Learning (FL) is similar to weight-sharing in supernet training.
Under our framework, we present an efficient algorithm (E-FedSup) by sending the sub-model to clients in the broadcast stage for reducing communication costs and training overhead.
arXiv Detail & Related papers (2022-06-03T02:21:01Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - Applications of Deep Learning to the Design of Enhanced Wireless
Communication Systems [0.0]
Deep learning (DL)-based systems are able to handle increasingly complex tasks for which no tractable models are available.
This thesis aims at comparing different approaches to unlock the full potential of DL in the physical layer.
arXiv Detail & Related papers (2022-05-02T21:02:14Z) - FLoBC: A Decentralized Blockchain-Based Federated Learning Framework [0.0]
In this work, we demonstrate our solution for building a generic decentralized federated learning system using blockchain technology.
We present our system design comprising the two decentralized actors: trainer and validator, alongside our methodology for ensuring reliable and efficient operation.
Finally, we utilize FLoBC as an experimental sandbox to compare and contrast the effects of trainer-to-validator ratio, reward-penalty policy, and model synchronization schemes on the overall system performance.
arXiv Detail & Related papers (2021-12-22T13:36:49Z) - Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems.
First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch.
Second, for diverse system scales and structures, we use graph neural networks to embed system state information.
Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z) - Deep Multi-Task Learning for Cooperative NOMA: System Design and
Principles [52.79089414630366]
We develop a novel deep cooperative NOMA scheme, drawing upon the recent advances in deep learning (DL)
We develop a novel hybrid-cascaded deep neural network (DNN) architecture such that the entire system can be optimized in a holistic manner.
arXiv Detail & Related papers (2020-07-27T12:38:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.