Pipeline Parallelism for Inference on Heterogeneous Edge Computing
- URL: http://arxiv.org/abs/2110.14895v1
- Date: Thu, 28 Oct 2021 05:20:51 GMT
- Title: Pipeline Parallelism for Inference on Heterogeneous Edge Computing
- Authors: Yang Hu, Connor Imes, Xuanang Zhao, Souvik Kundu, Peter A. Beerel,
Stephen P. Crago, John Paul N. Walters
- Abstract summary: Deep neural networks with large model sizes achieve state-of-the-art results for tasks in computer vision (CV) and natural language processing (NLP)
These large-scale models are too compute- or memory-intensive for resource-constrained edge devices.
We propose EdgePipe, a distributed framework for edge systems that uses pipeline parallelism to both speed up inference and enable running larger models that otherwise cannot fit on single edge devices.
- Score: 9.745025902229882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks with large model sizes achieve state-of-the-art results
for tasks in computer vision (CV) and natural language processing (NLP).
However, these large-scale models are too compute- or memory-intensive for
resource-constrained edge devices. Prior works on parallel and distributed
execution primarily focus on training -- rather than inference -- using
homogeneous accelerators in data centers. We propose EdgePipe, a distributed
framework for edge systems that uses pipeline parallelism to both speed up
inference and enable running larger (and more accurate) models that otherwise
cannot fit on single edge devices. EdgePipe achieves these results by using an
optimal partition strategy that considers heterogeneity in compute, memory, and
network bandwidth. Our empirical evaluation demonstrates that EdgePipe achieves
$10.59\times$ and $11.88\times$ speedup using 16 edge devices for the ViT-Large
and ViT-Huge models, respectively, with no accuracy loss. Similarly, EdgePipe
improves ViT-Huge throughput by $3.93\times$ over a 4-node baseline using 16
edge devices, which independently cannot fit the model in memory. Finally, we
show up to $4.16\times$ throughput improvement over the state-of-the-art
PipeDream when using a heterogeneous set of devices.
Related papers
- Iterative Filter Pruning for Concatenation-based CNN Architectures [9.651318927588934]
Modern object detectors have highly interconnected convolutional layers with concatenations.
We propose a method to handle concatenation layers, based on the connectivity graph of convolutional layers.
We deploy pruned models to FPGA and NVIDIA Jetson Xavier AGX.
arXiv Detail & Related papers (2024-05-04T19:40:42Z) - ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel
Training [22.107070114339038]
We propose AccEPT, an acceleration scheme for accelerating the edge collaborative pipeline-parallel training.
In particular, we propose a light-weight adaptive latency predictor to accurately estimate the latency of each layer at different devices.
Our numerical results demonstrate that our proposed acceleration approach is able to significantly speed up edge pipeline parallel training up to 3 times faster.
arXiv Detail & Related papers (2023-11-10T02:18:33Z) - Design and Prototyping Distributed CNN Inference Acceleration in Edge
Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing.
Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16.
It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z) - Receptive Field-based Segmentation for Distributed CNN Inference
Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network.
We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - 1$\times$N Block Pattern for Network Sparsity [90.43191747596491]
We propose one novel concept of $1times N$ block sparsity pattern (block pruning) to break this limitation.
Our pattern obtains about 3.0% improvements over filter pruning in the top-1 accuracy of MobileNet-V2.
It also obtains 56.04ms inference savings on Cortex-A7 CPU over weight pruning.
arXiv Detail & Related papers (2021-05-31T05:50:33Z) - Pipelined Training with Stale Weights of Deep Convolutional Neural
Networks [0.1921787217122713]
We explore the impact of stale weights on the statistical efficiency and performance in a pipelined backpropagation scheme.
We show that when pipelining is limited to early layers in a network, training with stale weights converges and results in models with comparable inference accuracies.
We propose combining pipelined and non-pipelined training in a hybrid scheme to address this drop.
arXiv Detail & Related papers (2019-12-29T15:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.