Related papers: LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition

LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition

URL: http://arxiv.org/abs/2003.06464v2
Date: Tue, 17 Nov 2020 17:15:43 GMT
Title: LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition
Authors: Ramyad Hadidi, Bahar Asgari, Jiashen Cao, Younmin Bae, Da Eun Shim, Hyojong Kim, Sung-Kyu Lim, Michael S. Ryoo, Hyesoon Kim
Abstract summary: We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x.
Score: 33.581285906182075
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices by using data- or model-parallelism methods is not an effective solution. To benefit from available compute resources with low communication overhead, we propose the first DNN parallelization method for reducing the communication overhead in a distributed system. We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. LCP offers close-to-minimum communication overhead with better distribution and parallelization opportunities while significantly reducing memory footprint and computation compared to data- and model-parallelism methods. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. We also evaluate the performance of LCP models on a customized hardware (tailored for low latency) implemented on a small edge FPGA and as a 16mW 0.107mm2 ASIC @7nm chip. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x by incorporating common optimizations such as pruning and quantization.

Related papers

Split Learning in Computer Vision for Semantic Segmentation Delay Minimization [25.0679083637967]
We propose a novel approach to minimize the inference delay in semantic segmentation using split learning (SL) SL is tailored to the needs of real-time computer vision (CV) applications for resource-constrained devices.
arXiv Detail & Related papers (2024-12-18T19:07:25Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Accelerating Split Federated Learning over Wireless Communication Networks [17.97006656280742]
We consider a split federated learning (SFL) framework that combines the parallel model training mechanism of federated learning (FL) and the model splitting structure of split learning (SL) We formulate a joint problem of split point selection and bandwidth allocation to minimize the system latency. Experiment results demonstrate the superiority of our work in latency reduction and accuracy improvement.
arXiv Detail & Related papers (2023-10-24T07:49:56Z)
Combining Multi-Objective Bayesian Optimization with Reinforcement Learning for TinyML [4.2019872499238256]
We propose a novel strategy for deploying Deep Neural Networks on microcontrollers (TinyML) based on Multi-Objective Bayesian optimization (MOBOpt) Our methodology aims at efficiently finding tradeoffs between a DNN's predictive accuracy, memory consumption on a given target system, and computational complexity.
arXiv Detail & Related papers (2023-05-23T14:31:52Z)
Receptive Field-based Segmentation for Distributed CNN Inference Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network. We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Communication-Efficient Separable Neural Network for Distributed Inference on Edge Devices [2.28438857884398]
We propose a novel method of exploiting model parallelism to separate a neural network for distributed inferences. Under proper specifications of devices and configurations of models, our experiments show that the inference of large neural networks on edge clusters can be distributed and accelerated.
arXiv Detail & Related papers (2021-11-03T19:30:28Z)
Computational Intelligence and Deep Learning for Next-Generation Edge-Enabled Industrial IoT [51.68933585002123]
We investigate how to deploy computational intelligence and deep learning (DL) in edge-enabled industrial IoT networks. In this paper, we propose a novel multi-exit-based federated edge learning (ME-FEEL) framework. In particular, the proposed ME-FEEL can achieve an accuracy gain up to 32.7% in the industrial IoT networks with the severely limited resources.
arXiv Detail & Related papers (2021-10-28T08:14:57Z)
Deep Learning-based Resource Allocation For Device-to-Device Communication [66.74874646973593]
We propose a framework for the optimization of the resource allocation in multi-channel cellular systems with device-to-device (D2D) communication. A deep learning (DL) framework is proposed, where the optimal resource allocation strategy for arbitrary channel conditions is approximated by deep neural network (DNN) models. Our simulation results confirm that near-optimal performance can be attained with low time, which underlines the real-time capability of the proposed scheme.
arXiv Detail & Related papers (2020-11-25T14:19:23Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.