Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs
- URL: http://arxiv.org/abs/2209.13443v2
- Date: Fri, 4 Aug 2023 21:29:07 GMT
- Title: Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs
- Authors: Alexandros Kouris, Stylianos I. Venieris, Stefanos Laskaridis,
Nicholas D. Lane
- Abstract summary: "smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
- Score: 74.83613252825754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With deep neural networks (DNNs) emerging as the backbone in a multitude of
computer vision tasks, their adoption in real-world applications broadens
continuously. Given the abundance and omnipresence of smart devices in the
consumer landscape, "smart ecosystems'' are being formed where sensing happens
concurrently rather than standalone. This is shifting the on-device inference
paradigm towards deploying centralised neural processing units (NPUs) at the
edge, where multiple devices (e.g. in smart homes or autonomous vehicles) can
stream their data for processing with dynamic rates. While this provides
enhanced potential for input batching, naive solutions can lead to subpar
performance and quality of experience, especially under spiking loads. At the
same time, the deployment of dynamic DNNs, comprising stochastic computation
graphs (e.g. early-exit (EE) models), introduces a new dimension of dynamic
behaviour in such systems. In this work, we propose a novel early-exit-aware
scheduling algorithm that allows sample preemption at run time, to account for
the dynamicity introduced both by the arrival and early-exiting processes. At
the same time, we introduce two novel dimensions to the design space of the NPU
hardware architecture, namely Fluid Batching and Stackable Processing Elements,
that enable run-time adaptability to different batch sizes and significantly
improve the NPU utilisation even at small batches. Our evaluation shows that
the proposed system achieves an average 1.97x and 6.7x improvement over
state-of-the-art DNN streaming systems in terms of average latency and tail
latency service-level objective (SLO) satisfaction, respectively.
Related papers
- Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - EvSegSNN: Neuromorphic Semantic Segmentation for Event Data [0.6138671548064356]
EvSegSNN is a biologically plausible encoder-decoder U-shaped architecture relying on Parametric Leaky Integrate and Fire neurons.
We introduce an end-to-end biologically inspired semantic segmentation approach by combining Spiking Neural Networks with event cameras.
Experiments conducted on DDD17 demonstrate that EvSegSNN outperforms the closest state-of-the-art model in terms of MIoU.
arXiv Detail & Related papers (2024-06-20T10:36:24Z) - Adaptive Robotic Arm Control with a Spiking Recurrent Neural Network on a Digital Accelerator [41.60361484397962]
We present an overview of the system, and a Python framework to use it on a Pynq ZU platform.
We show how the simulated accuracy is preserved with a peak performance of 3.8M events processed per second.
arXiv Detail & Related papers (2024-05-21T14:59:39Z) - DYNAP-SE2: a scalable multi-core dynamic neuromorphic asynchronous
spiking neural network processor [2.9175555050594975]
We present a brain-inspired platform for prototyping real-time event-based Spiking Neural Networks (SNNs)
The system proposed supports the direct emulation of dynamic and realistic neural processing phenomena such as short-term plasticity, NMDA gating, AMPA diffusion, homeostasis, spike frequency adaptation, conductance-based dendritic compartments and spike transmission delays.
The flexibility to emulate different biologically plausible neural networks, and the chip's ability to monitor both population and single neuron signals in real-time, allow to develop and validate complex models of neural processing for both basic research and edge-computing applications.
arXiv Detail & Related papers (2023-10-01T03:48:16Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process.
We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - HAPI: Hardware-Aware Progressive Inference [18.214367595727037]
Convolutional neural networks (CNNs) have recently become the state-of-the-art in a diversity of AI tasks.
Despite their popularity, CNN inference still comes at a high computational cost.
This work presents HAPI, a novel methodology for generating high-performance early-exit networks.
arXiv Detail & Related papers (2020-08-10T09:55:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.