HAPI: Hardware-Aware Progressive Inference
- URL: http://arxiv.org/abs/2008.03997v1
- Date: Mon, 10 Aug 2020 09:55:18 GMT
- Title: HAPI: Hardware-Aware Progressive Inference
- Authors: Stefanos Laskaridis, Stylianos I. Venieris, Hyeji Kim and Nicholas D.
Lane
- Abstract summary: Convolutional neural networks (CNNs) have recently become the state-of-the-art in a diversity of AI tasks.
Despite their popularity, CNN inference still comes at a high computational cost.
This work presents HAPI, a novel methodology for generating high-performance early-exit networks.
- Score: 18.214367595727037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks (CNNs) have recently become the
state-of-the-art in a diversity of AI tasks. Despite their popularity, CNN
inference still comes at a high computational cost. A growing body of work aims
to alleviate this by exploiting the difference in the classification difficulty
among samples and early-exiting at different stages of the network.
Nevertheless, existing studies on early exiting have primarily focused on the
training scheme, without considering the use-case requirements or the
deployment platform. This work presents HAPI, a novel methodology for
generating high-performance early-exit networks by co-optimising the placement
of intermediate exits together with the early-exit strategy at inference time.
Furthermore, we propose an efficient design space exploration algorithm which
enables the faster traversal of a large number of alternative architectures and
generates the highest-performing design, tailored to the use-case requirements
and target hardware. Quantitative evaluation shows that our system consistently
outperforms alternative search mechanisms and state-of-the-art early-exit
schemes across various latency budgets. Moreover, it pushes further the
performance of highly optimised hand-crafted early-exit CNNs, delivering up to
5.11x speedup over lightweight models on imposed latency-driven SLAs for
embedded devices.
Related papers
- Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - FreeREA: Training-Free Evolution-based Architecture Search [17.202375422110553]
FreeREA is a custom cell-based evolution NAS algorithm that exploits an optimised combination of training-free metrics to rank architectures.
Our experiments, carried out on the common benchmarks NAS-Bench-101 and NATS-Bench, demonstrate that i) FreeREA is a fast, efficient, and effective search method for models automatic design.
arXiv Detail & Related papers (2022-06-17T11:16:28Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - Scaled-Time-Attention Robust Edge Network [2.4417312983418014]
This paper describes a systematic approach towards building a new family of neural networks based on a delay-loop version of a reservoir neural network.
The resulting architecture, called Scaled-Time-Attention Robust Edge (STARE) network, exploits hyper dimensional space and non-multiply-and-add computation.
We demonstrate that STARE is applicable to a variety of applications with improved performance and lower implementation complexity.
arXiv Detail & Related papers (2021-07-09T21:24:49Z) - How to Reach Real-Time AI on Consumer Devices? Solutions for
Programmable and Custom Architectures [7.085772863979686]
Deep neural networks (DNNs) have led to large strides in various Artificial Intelligence (AI) inference tasks, such as object and speech recognition.
deploying such AI models across commodity devices faces significant challenges.
We present techniques for achieving real-time performance following a cross-stack approach.
arXiv Detail & Related papers (2021-06-21T11:23:12Z) - Adaptive Inference through Early-Exit Networks: Design, Challenges and
Directions [80.78077900288868]
We decompose the design methodology of early-exit networks to its key components and survey the recent advances in each one of them.
We position early-exiting against other efficient inference solutions and provide our insights on the current challenges and most promising future directions for research in the field.
arXiv Detail & Related papers (2021-06-09T12:33:02Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search [94.80212602202518]
We propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS)
We employ a one-shot architecture search approach in order to obtain a reduced search cost.
We achieve state-of-the-art results in terms of accuracy-speed trade-off.
arXiv Detail & Related papers (2020-09-29T11:56:01Z) - Automated Design Space Exploration for optimised Deployment of DNN on
Arm Cortex-A CPUs [13.628734116014819]
Deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN)
There is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution.
We present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory.
arXiv Detail & Related papers (2020-06-09T11:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.