Related papers: AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

URL: http://arxiv.org/abs/2311.05827v1
Date: Fri, 10 Nov 2023 02:18:33 GMT
Title: AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training
Authors: Yuhao Chen, Yuxuan Yan, Qianqian Yang, Yuanchao Shu, Shibo He, Zhiguo Shi, Jiming Chen
Abstract summary: We propose AccEPT, an acceleration scheme for accelerating the edge collaborative pipeline-parallel training. In particular, we propose a light-weight adaptive latency predictor to accurately estimate the latency of each layer at different devices. Our numerical results demonstrate that our proposed acceleration approach is able to significantly speed up edge pipeline parallel training up to 3 times faster.
Score: 22.107070114339038
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: It is usually infeasible to fit and train an entire large deep neural network (DNN) model using a single edge device due to the limited resources. To facilitate intelligent applications across edge devices, researchers have proposed partitioning a large model into several sub-models, and deploying each of them to a different edge device to collaboratively train a DNN model. However, the communication overhead caused by the large amount of data transmitted from one device to another during training, as well as the sub-optimal partition point due to the inaccurate latency prediction of computation at each edge device can significantly slow down training. In this paper, we propose AccEPT, an acceleration scheme for accelerating the edge collaborative pipeline-parallel training. In particular, we propose a light-weight adaptive latency predictor to accurately estimate the computation latency of each layer at different devices, which also adapts to unseen devices through continuous learning. Therefore, the proposed latency predictor leads to better model partitioning which balances the computation loads across participating devices. Moreover, we propose a bit-level computation-efficient data compression scheme to compress the data to be transmitted between devices during training. Our numerical results demonstrate that our proposed acceleration approach is able to significantly speed up edge pipeline parallel training up to 3 times faster in the considered experimental settings.

Related papers

DHO$_2$: Accelerating Distributed Hybrid Order Optimization via Model Parallelism and ADMM [13.381600281040287]
FOSI, as a hybrid order, converges faster than conventional gradients.<n>It achieves $1.4times2.1times speedup in the total training time.
arXiv Detail & Related papers (2025-05-02T04:02:36Z)
Federated Split Learning with Model Pruning and Gradient Quantization in Wireless Networks [7.439160287320074]
Federated split learning (FedSL) implements collaborative training across the edge devices and the server through model splitting. We propose a lightweight FedSL scheme, that further alleviates the training burden on resource-constrained edge devices. We conduct theoretical analysis to quantify the convergence performance of the proposed scheme.
arXiv Detail & Related papers (2024-12-09T11:43:03Z)
DeMo: Decoupled Momentum Optimization [6.169574689318864]
Training large neural networks typically requires sharing between accelerators through specialized high-speed interconnects. We introduce bfDecoupled textbfMomentum (DeMo), a fused magnitude and data parallel algorithm that reduces inter-accelerator communication requirements. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW.
arXiv Detail & Related papers (2024-11-29T17:31:47Z)
Edge-Enabled Real-time Railway Track Segmentation [0.0]
We propose an edge-enabled real-time railway track segmentation algorithm. It is optimized to be suitable for edge applications by optimizing the network structure and quantizing the model after training. Experimental results demonstrate that our enhanced algorithm achieves an accuracy level of 83.3%.
arXiv Detail & Related papers (2024-01-21T13:45:52Z)
Efficient Asynchronous Federated Learning with Sparsification and Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training. We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z)
Design and Prototyping Distributed CNN Inference Acceleration in Edge Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing. Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16. It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z)
Efficient Graph Neural Network Inference at Large Scale [54.89457550773165]
Graph neural networks (GNNs) have demonstrated excellent performance in a wide range of applications. Existing scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure. We propose a novel adaptive propagation order approach that generates the personalized propagation order for each node based on its topological information.
arXiv Detail & Related papers (2022-11-01T14:38:18Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Fast-Convergent Federated Learning [82.32029953209542]
Federated learning is a promising solution for distributing machine learning tasks through modern networks of mobile devices. We propose a fast-convergent federated learning algorithm, called FOLB, which performs intelligent sampling of devices in each round of model training.
arXiv Detail & Related papers (2020-07-26T14:37:51Z)
Joint Device Scheduling and Resource Allocation for Latency Constrained Wireless Federated Learning [26.813145949399427]
In federated learning (FL), devices upload their local model updates via wireless channels. We propose a joint device scheduling and resource allocation policy to maximize the model accuracy. Experiments show that the proposed policy outperforms state-of-the-art scheduling policies.
arXiv Detail & Related papers (2020-07-14T16:46:47Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.