LCP: A Low-Communication Parallelization Method for Fast Neural Network
Inference in Image Recognition
- URL: http://arxiv.org/abs/2003.06464v2
- Date: Tue, 17 Nov 2020 17:15:43 GMT
- Title: LCP: A Low-Communication Parallelization Method for Fast Neural Network
Inference in Image Recognition
- Authors: Ramyad Hadidi, Bahar Asgari, Jiashen Cao, Younmin Bae, Da Eun Shim,
Hyojong Kim, Sung-Kyu Lim, Michael S. Ryoo, Hyesoon Kim
- Abstract summary: We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches.
We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards.
LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x.
- Score: 33.581285906182075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks (DNNs) have inspired new studies in myriad edge
applications with robots, autonomous agents, and Internet-of-things (IoT)
devices. However, performing inference of DNNs in the edge is still a severe
challenge, mainly because of the contradiction between the intensive resource
requirements of DNNs and the tight resource availability in several edge
domains. Further, as communication is costly, taking advantage of other
available edge devices by using data- or model-parallelism methods is not an
effective solution. To benefit from available compute resources with low
communication overhead, we propose the first DNN parallelization method for
reducing the communication overhead in a distributed system. We propose a
low-communication parallelization (LCP) method in which models consist of
several almost-independent and narrow branches. LCP offers close-to-minimum
communication overhead with better distribution and parallelization
opportunities while significantly reducing memory footprint and computation
compared to data- and model-parallelism methods. We deploy LCP models on three
distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. We also
evaluate the performance of LCP models on a customized hardware (tailored for
low latency) implemented on a small edge FPGA and as a 16mW 0.107mm2 ASIC @7nm
chip. LCP models achieve a maximum and average speedups of 56x and 7x, compared
to the originals, which could be improved by up to an average speedup of 33x by
incorporating common optimizations such as pruning and quantization.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Accelerating Split Federated Learning over Wireless Communication
Networks [17.97006656280742]
We consider a split federated learning (SFL) framework that combines the parallel model training mechanism of federated learning (FL) and the model splitting structure of split learning (SL)
We formulate a joint problem of split point selection and bandwidth allocation to minimize the system latency.
Experiment results demonstrate the superiority of our work in latency reduction and accuracy improvement.
arXiv Detail & Related papers (2023-10-24T07:49:56Z) - Combining Multi-Objective Bayesian Optimization with Reinforcement Learning for TinyML [4.2019872499238256]
We propose a novel strategy for deploying Deep Neural Networks on microcontrollers (TinyML) based on Multi-Objective Bayesian optimization (MOBOpt)
Our methodology aims at efficiently finding tradeoffs between a DNN's predictive accuracy, memory consumption on a given target system, and computational complexity.
arXiv Detail & Related papers (2023-05-23T14:31:52Z) - Receptive Field-based Segmentation for Distributed CNN Inference
Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network.
We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Communication-Efficient Separable Neural Network for Distributed
Inference on Edge Devices [2.28438857884398]
We propose a novel method of exploiting model parallelism to separate a neural network for distributed inferences.
Under proper specifications of devices and configurations of models, our experiments show that the inference of large neural networks on edge clusters can be distributed and accelerated.
arXiv Detail & Related papers (2021-11-03T19:30:28Z) - Computational Intelligence and Deep Learning for Next-Generation
Edge-Enabled Industrial IoT [51.68933585002123]
We investigate how to deploy computational intelligence and deep learning (DL) in edge-enabled industrial IoT networks.
In this paper, we propose a novel multi-exit-based federated edge learning (ME-FEEL) framework.
In particular, the proposed ME-FEEL can achieve an accuracy gain up to 32.7% in the industrial IoT networks with the severely limited resources.
arXiv Detail & Related papers (2021-10-28T08:14:57Z) - Deep Learning-based Resource Allocation For Device-to-Device
Communication [66.74874646973593]
We propose a framework for the optimization of the resource allocation in multi-channel cellular systems with device-to-device (D2D) communication.
A deep learning (DL) framework is proposed, where the optimal resource allocation strategy for arbitrary channel conditions is approximated by deep neural network (DNN) models.
Our simulation results confirm that near-optimal performance can be attained with low time, which underlines the real-time capability of the proposed scheme.
arXiv Detail & Related papers (2020-11-25T14:19:23Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.