Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices
- URL: http://arxiv.org/abs/2408.08015v1
- Date: Thu, 15 Aug 2024 08:25:50 GMT
- Title: Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices
- Authors: Shengyuan Ye, Liekang Zeng, Xiaowen Chu, Guoliang Xing, Xu Chen,
- Abstract summary: On-device Deep Neural Network (DNN) training has been recognized as crucial for privacy-preserving machine learning at the edge.
We propose Asteroid, a distributed edge training system that breaks the resource walls across heterogeneous edge devices for efficient model training acceleration.
- Score: 13.24437638911459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On-device Deep Neural Network (DNN) training has been recognized as crucial for privacy-preserving machine learning at the edge. However, the intensive training workload and limited onboard computing resources pose significant challenges to the availability and efficiency of model training. While existing works address these challenges through native resource management optimization, we instead leverage our observation that edge environments usually comprise a rich set of accompanying trusted edge devices with idle resources beyond a single terminal. We propose Asteroid, a distributed edge training system that breaks the resource walls across heterogeneous edge devices for efficient model training acceleration. Asteroid adopts a hybrid pipeline parallelism to orchestrate distributed training, along with a judicious parallelism planning for maximizing throughput under certain resource constraints. Furthermore, a fault-tolerant yet lightweight pipeline replay mechanism is developed to tame the device-level dynamics for training robustness and performance stability. We implement Asteroid on heterogeneous edge devices with both vision and language models, demonstrating up to 12.2x faster training than conventional parallelism methods and 2.1x faster than state-of-the-art hybrid parallelism methods through evaluations. Furthermore, Asteroid can recover training pipeline 14x faster than baseline methods while preserving comparable throughput despite unexpected device exiting and failure.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Heterogeneity-Aware Resource Allocation and Topology Design for Hierarchical Federated Edge Learning [9.900317349372383]
Federated Learning (FL) provides a privacy-preserving framework for training machine learning models on mobile edge devices.
Traditional FL algorithms, e.g., FedAvg, impose a heavy communication workload on these devices.
We propose a two-tier HFEL system, where edge devices are connected to edge servers and edge servers are interconnected through peer-to-peer (P2P) edge backhauls.
Our goal is to enhance the training efficiency of the HFEL system through strategic resource allocation and topology design.
arXiv Detail & Related papers (2024-09-29T01:48:04Z) - Online Pseudo-Zeroth-Order Training of Neuromorphic Spiking Neural Networks [69.2642802272367]
Brain-inspired neuromorphic computing with spiking neural networks (SNNs) is a promising energy-efficient computational approach.
Most recent methods leverage spatial and temporal backpropagation (BP), not adhering to neuromorphic properties.
We propose a novel method, online pseudo-zeroth-order (OPZO) training.
arXiv Detail & Related papers (2024-07-17T12:09:00Z) - Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference [19.60655813679882]
Transformer-based models have unlocked a plethora of powerful intelligent applications at the edge.
Traditional deployment approaches offload the inference workloads to the remote cloud server.
We propose Galaxy, a collaborative edge AI system that breaks the resource walls across heterogeneous edge devices.
arXiv Detail & Related papers (2024-05-27T15:01:04Z) - Implementation of Big AI Models for Wireless Networks with Collaborative Edge Computing [10.524645516703643]
Training big AI models poses significant challenges to edge devices.
Traditional approaches usually resort to aggregating training data and sending it to a remote cloud for centralized training.
We propose collaborative edge training, a novel training mechanism that orchestrates a group of trusted edge devices as a resource pool.
arXiv Detail & Related papers (2024-04-27T03:09:39Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training
Framework for Heterogeneous Edge Devices [21.513786638743234]
FTPipeHD is a novel framework that trains deep learning models across heterogeneous devices.
It is shown that FTPipeHD is 6.8x faster in training than the state of the art method when the computing capacity of the best device is 10x greater than the worst one.
arXiv Detail & Related papers (2021-10-06T14:00:22Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.