AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel
Training
- URL: http://arxiv.org/abs/2311.05827v1
- Date: Fri, 10 Nov 2023 02:18:33 GMT
- Title: AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel
Training
- Authors: Yuhao Chen, Yuxuan Yan, Qianqian Yang, Yuanchao Shu, Shibo He, Zhiguo
Shi, Jiming Chen
- Abstract summary: We propose AccEPT, an acceleration scheme for accelerating the edge collaborative pipeline-parallel training.
In particular, we propose a light-weight adaptive latency predictor to accurately estimate the latency of each layer at different devices.
Our numerical results demonstrate that our proposed acceleration approach is able to significantly speed up edge pipeline parallel training up to 3 times faster.
- Score: 22.107070114339038
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: It is usually infeasible to fit and train an entire large deep neural network
(DNN) model using a single edge device due to the limited resources. To
facilitate intelligent applications across edge devices, researchers have
proposed partitioning a large model into several sub-models, and deploying each
of them to a different edge device to collaboratively train a DNN model.
However, the communication overhead caused by the large amount of data
transmitted from one device to another during training, as well as the
sub-optimal partition point due to the inaccurate latency prediction of
computation at each edge device can significantly slow down training. In this
paper, we propose AccEPT, an acceleration scheme for accelerating the edge
collaborative pipeline-parallel training. In particular, we propose a
light-weight adaptive latency predictor to accurately estimate the computation
latency of each layer at different devices, which also adapts to unseen devices
through continuous learning. Therefore, the proposed latency predictor leads to
better model partitioning which balances the computation loads across
participating devices. Moreover, we propose a bit-level computation-efficient
data compression scheme to compress the data to be transmitted between devices
during training. Our numerical results demonstrate that our proposed
acceleration approach is able to significantly speed up edge pipeline parallel
training up to 3 times faster in the considered experimental settings.
Related papers
- Federated Split Learning with Model Pruning and Gradient Quantization in Wireless Networks [7.439160287320074]
Federated split learning (FedSL) implements collaborative training across the edge devices and the server through model splitting.
We propose a lightweight FedSL scheme, that further alleviates the training burden on resource-constrained edge devices.
We conduct theoretical analysis to quantify the convergence performance of the proposed scheme.
arXiv Detail & Related papers (2024-12-09T11:43:03Z) - DeMo: Decoupled Momentum Optimization [6.169574689318864]
Training large neural networks typically requires sharing between accelerators through specialized high-speed interconnects.
We introduce bfDecoupled textbfMomentum (DeMo), a fused magnitude and data parallel algorithm that reduces inter-accelerator communication requirements.
Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW.
arXiv Detail & Related papers (2024-11-29T17:31:47Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Design and Prototyping Distributed CNN Inference Acceleration in Edge
Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing.
Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16.
It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z) - Efficient Graph Neural Network Inference at Large Scale [54.89457550773165]
Graph neural networks (GNNs) have demonstrated excellent performance in a wide range of applications.
Existing scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure.
We propose a novel adaptive propagation order approach that generates the personalized propagation order for each node based on its topological information.
arXiv Detail & Related papers (2022-11-01T14:38:18Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Fast-Convergent Federated Learning [82.32029953209542]
Federated learning is a promising solution for distributing machine learning tasks through modern networks of mobile devices.
We propose a fast-convergent federated learning algorithm, called FOLB, which performs intelligent sampling of devices in each round of model training.
arXiv Detail & Related papers (2020-07-26T14:37:51Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.