SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond
the Memory Budget
- URL: http://arxiv.org/abs/2401.16757v1
- Date: Tue, 30 Jan 2024 05:29:49 GMT
- Title: SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond
the Memory Budget
- Authors: Kun Wang, Jiani Cao, Zimu Zhou and Zhenjiang Li
- Abstract summary: Deep neural networks (DNNs) on edge artificial intelligence (AI) devices enable various autonomous mobile computing applications.
Existing solutions, such as model compression or cloud offloading, reduce the memory footprint of DNN inference.
We develop SwapNet, an efficient block swapping ecosystem for edge AI devices.
- Score: 18.63754969602021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Executing deep neural networks (DNNs) on edge artificial intelligence (AI)
devices enables various autonomous mobile computing applications. However, the
memory budget of edge AI devices restricts the number and complexity of DNNs
allowed in such applications. Existing solutions, such as model compression or
cloud offloading, reduce the memory footprint of DNN inference at the cost of
decreased model accuracy or autonomy. To avoid these drawbacks, we divide DNN
into blocks and swap them in and out in order, such that large DNNs can execute
within a small memory budget. Nevertheless, naive swapping on edge AI devices
induces significant delays due to the redundant memory operations in the DNN
development ecosystem for edge AI devices. To this end, we develop SwapNet, an
efficient DNN block swapping middleware for edge AI devices. We systematically
eliminate the unnecessary memory operations during block swapping while
retaining compatible with the deep learning frameworks, GPU backends, and
hardware architectures of edge AI devices. We further showcase the utility of
SwapNet via a multi-DNN scheduling scheme. Evaluations on eleven DNN inference
tasks in three applications demonstrate that SwapNet achieves almost the same
latency as the case with sufficient memory even when DNNs demand 2.32x to 5.81x
memory beyond the available budget. The design of SwapNet also provides novel
and feasible insights for deploying large language models (LLMs) on edge AI
devices in the future.
Related papers
- MatchNAS: Optimizing Edge AI in Sparse-Label Data Contexts via
Automating Deep Neural Network Porting for Mobile Deployment [54.77943671991863]
MatchNAS is a novel scheme for porting Deep Neural Networks to mobile devices.
We optimise a large network family using both labelled and unlabelled data.
We then automatically search for tailored networks for different hardware platforms.
arXiv Detail & Related papers (2024-02-21T04:43:12Z) - Enabling Deep Learning on Edge Devices [2.741266294612776]
Deep neural networks (DNNs) have succeeded in many different perception tasks, e.g., computer vision, natural language processing, reinforcement learning, etc.
The high-performed DNNs heavily rely on intensive resource consumption.
Recently, some new emerging intelligent applications, e.g., AR/VR, mobile assistants, Internet of Things, require us to deploy DNNs on resource-constrained edge devices.
In this dissertation, we studied four edge intelligence scenarios, i.e., Inference on Edge Devices, Adaptation on Edge Devices, Learning on Edge Devices, and Edge-Server Systems
arXiv Detail & Related papers (2022-10-06T20:52:57Z) - An efficient and flexible inference system for serving heterogeneous
ensembles of deep neural networks [0.0]
Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive.
We propose a new software layer to serve with flexibility and efficiency ensembles of DNNs.
arXiv Detail & Related papers (2022-08-30T08:05:43Z) - Sparsifying Binary Networks [3.8350038566047426]
Binary neural networks (BNNs) have demonstrated their ability to solve complex tasks with comparable accuracy as full-precision deep neural networks (DNNs)
Despite the recent improvements, they suffer from a fixed and limited compression factor that may result insufficient for certain devices with very limited resources.
We propose sparse binary neural networks (SBNNs), a novel model and training scheme which introduces sparsity in BNNs and a new quantization function for binarizing the network's weights.
arXiv Detail & Related papers (2022-07-11T15:54:41Z) - Dynamic Split Computing for Efficient Deep Edge Intelligence [78.4233915447056]
We introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel.
We show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
arXiv Detail & Related papers (2022-05-23T12:35:18Z) - Training High-Performance Low-Latency Spiking Neural Networks by
Differentiation on Spike Representation [70.75043144299168]
Spiking Neural Network (SNN) is a promising energy-efficient AI model when implemented on neuromorphic hardware.
It is a challenge to efficiently train SNNs due to their non-differentiability.
We propose the Differentiation on Spike Representation (DSR) method, which could achieve high performance.
arXiv Detail & Related papers (2022-05-01T12:44:49Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Dynamic DNN Decomposition for Lossless Synergistic Inference [0.9549013615433989]
Deep neural networks (DNNs) sustain high performance in today's data processing applications.
We propose D3, a dynamic DNN decomposition system for synergistic inference without precision loss.
D3 outperforms the state-of-the-art counterparts up to 3.4 times in end-to-end DNN inference time and reduces backbone network communication overhead up to 3.68 times.
arXiv Detail & Related papers (2021-01-15T03:18:53Z) - Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning
and Compiler Optimization [56.3111706960878]
High-end mobile platforms serve as primary computing devices for a wide range of Deep Neural Network (DNN) applications.
constrained computation and storage resources on these devices pose significant challenges for real-time inference executions.
We propose a set of hardware-friendly structured model pruning and compiler optimization techniques to accelerate DNN executions on mobile devices.
arXiv Detail & Related papers (2020-04-22T03:18:23Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.