FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training
Framework for Heterogeneous Edge Devices
- URL: http://arxiv.org/abs/2110.02781v1
- Date: Wed, 6 Oct 2021 14:00:22 GMT
- Title: FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training
Framework for Heterogeneous Edge Devices
- Authors: Yuhao Chen, Qianqian Yang, Shibo He, Zhiguo Shi, Jiming Chen
- Abstract summary: FTPipeHD is a novel framework that trains deep learning models across heterogeneous devices.
It is shown that FTPipeHD is 6.8x faster in training than the state of the art method when the computing capacity of the best device is 10x greater than the worst one.
- Score: 21.513786638743234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increased penetration and proliferation of Internet of Things (IoT)
devices, there is a growing trend towards distributing the power of deep
learning (DL) across edge devices rather than centralizing it in the cloud.
This development enables better privacy preservation, real-time responses, and
user-specific models. To deploy deep and complex models to edge devices with
limited resources, model partitioning of deep neural networks (DNN) model is
necessary, and has been widely studied. However, most of the existing
literature only considers distributing the inference model while still relying
centralized cloud infrastructure to generate this model through training. In
this paper, we propose FTPipeHD, a novel DNN training framework that trains DNN
models across distributed heterogeneous devices with fault tolerance mechanism.
To accelerate the training with time-varying computing power of each device, we
optimize the partition points dynamically according to real-time computing
capacities. We also propose a novel weight redistribution approach that
replicates the weights to both the neighboring nodes and the central node
periodically, which combats the failure of multiple devices during training
while incurring limited communication cost. Our numerical results demonstrate
that FTPipeHD is 6.8x faster in training than the state of the art method when
the computing capacity of the best device is 10x greater than the worst one. It
is also shown that the proposed method is able to accelerate the training even
with the existence of device failures.
Related papers
- Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Slimmable Encoders for Flexible Split DNNs in Bandwidth and Resource
Constrained IoT Systems [12.427821850039448]
We propose a novel split computing approach based on slimmable ensemble encoders.
The key advantage of our design is the ability to adapt computational load and transmitted data size in real-time with minimal overhead and time.
Our model outperforms existing solutions in terms of compression efficacy and execution time, especially in the context of weak mobile devices.
arXiv Detail & Related papers (2023-06-22T06:33:12Z) - EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data
Reshaping for Online Adaptation or Personalization [11.44696439060875]
EF-Train is an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel.
It can achieve end-to-end training on resource-limited low-power edge-level FPGAs.
Our design achieves 46.99 GFLOPS and 6.09GFLOPS/W in terms of throughput and energy efficiency.
arXiv Detail & Related papers (2022-02-18T18:27:42Z) - Parallel Successive Learning for Dynamic Distributed Model Training over
Heterogeneous Wireless Networks [50.68446003616802]
Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices.
We develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions.
Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning.
arXiv Detail & Related papers (2022-02-07T05:11:01Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Computational Intelligence and Deep Learning for Next-Generation
Edge-Enabled Industrial IoT [51.68933585002123]
We investigate how to deploy computational intelligence and deep learning (DL) in edge-enabled industrial IoT networks.
In this paper, we propose a novel multi-exit-based federated edge learning (ME-FEEL) framework.
In particular, the proposed ME-FEEL can achieve an accuracy gain up to 32.7% in the industrial IoT networks with the severely limited resources.
arXiv Detail & Related papers (2021-10-28T08:14:57Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Sparse-Push: Communication- & Energy-Efficient Decentralized Distributed
Learning over Directed & Time-Varying Graphs with non-IID Datasets [2.518955020930418]
We propose Sparse-Push, a communication efficient decentralized distributed training algorithm.
The proposed algorithm enables 466x reduction in communication with only 1% degradation in performance.
We demonstrate how communication compression can lead to significant performance degradation in-case of non-IID datasets.
arXiv Detail & Related papers (2021-02-10T19:41:11Z) - Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network
Training [0.5219568203653523]
We develop a sparse DNN training accelerator that produces pruned models with the same accuracy as dense models without first training, then pruning, and finally retraining, a dense model.
Compared to training the equivalent unpruned models using a state-of-the-art DNN accelerator without sparse training support, Procrustes consumes up to 3.26$times$ less energy and offers up to 4$times$ speedup across a range of models, while pruning weights by an order of magnitude and maintaining unpruned accuracy.
arXiv Detail & Related papers (2020-09-23T07:39:55Z) - Fast-Convergent Federated Learning [82.32029953209542]
Federated learning is a promising solution for distributing machine learning tasks through modern networks of mobile devices.
We propose a fast-convergent federated learning algorithm, called FOLB, which performs intelligent sampling of devices in each round of model training.
arXiv Detail & Related papers (2020-07-26T14:37:51Z) - Deep Generative Models that Solve PDEs: Distributed Computing for
Training Large Data-Free Models [25.33147292369218]
Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs)
Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models.
Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods.
arXiv Detail & Related papers (2020-07-24T22:42:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.