Dynamic DNNs and Runtime Management for Efficient Inference on
Mobile/Embedded Devices
- URL: http://arxiv.org/abs/2401.08965v1
- Date: Wed, 17 Jan 2024 04:40:30 GMT
- Title: Dynamic DNNs and Runtime Management for Efficient Inference on
Mobile/Embedded Devices
- Authors: Lei Xun, Jonathon Hare, Geoff V. Merrett
- Abstract summary: Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms.
We co-designed novel Dynamic Super-Networks to maximise system-level performance and energy efficiency.
Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency.
- Score: 2.8851756275902476
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural network (DNN) inference is increasingly being executed on mobile
and embedded platforms due to several key advantages in latency, privacy and
always-on availability. However, due to limited computing resources, efficient
DNN deployment on mobile and embedded platforms is challenging. Although many
hardware accelerators and static model compression methods were proposed by
previous works, at system runtime, multiple applications are typically executed
concurrently and compete for hardware resources. This raises two main
challenges: Runtime Hardware Availability and Runtime Application Variability.
Previous works have addressed these challenges through either dynamic neural
networks that contain sub-networks with different performance trade-offs or
runtime hardware resource management. In this thesis, we proposed a combined
method, a system was developed for DNN performance trade-off management,
combining the runtime trade-off opportunities in both algorithms and hardware
to meet dynamically changing application performance targets and hardware
constraints in real time. We co-designed novel Dynamic Super-Networks to
maximise runtime system-level performance and energy efficiency on
heterogeneous hardware platforms. Compared with SOTA, our experimental results
using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for
similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We
also designed a hierarchical runtime resource manager that tunes both dynamic
neural networks and DVFS at runtime. Compared with the Linux DVFS governor
schedutil, our runtime approach achieves up to a 19% energy reduction and a 9%
latency reduction in single model deployment scenario, and an 89% energy
reduction and a 23% latency reduction in a two concurrent model deployment
scenario.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices.
We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling.
Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - HADAS: Hardware-Aware Dynamic Neural Architecture Search for Edge
Performance Scaling [8.29394286023338]
Dynamic neural networks (DyNNs) have become viable techniques to enable intelligence on resource-constrained edge devices.
In many cases, the implementation of DyNNs can be sub-optimal due to its underlying backbone architecture being developed at the design stage.
We present HADAS, a novel Hardware-Aware Dynamic Neural Architecture Search framework that realizes DyNN architectures whose backbone, early exiting features, and DVFS settings have been jointly optimized.
arXiv Detail & Related papers (2022-12-06T22:27:00Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Incremental Training and Group Convolution Pruning for Runtime DNN
Performance Scaling on Heterogeneous Embedded Platforms [23.00896228073755]
Inference for Deep Neural Networks is increasingly being executed locally on mobile and embedded platforms.
In this paper, we present a dynamic DNN using incremental training and group convolution pruning.
It achieved 10.6x (energy) and 41.6x (time) wider dynamic range by combining with task mapping and DVFS.
arXiv Detail & Related papers (2021-05-08T05:38:01Z) - Dynamic-OFA: Runtime DNN Architecture Switching for Performance Scaling
on Heterogeneous Embedded Platforms [3.3197851873862385]
This paper proposes Dynamic-OFA, a novel dynamic DNN approach for state-of-the-art platform-aware NAS models (i.e. Once-for-all network (OFA))
Compared to the state-of-the-art, our experimental results using ImageNet on a Jetson Xavier NX show that the approach is up to 3.5x faster for similar ImageNet Top-1 accuracy.
arXiv Detail & Related papers (2021-05-08T05:10:53Z) - Dynamic Slimmable Network [105.74546828182834]
We develop a dynamic network slimming regime named Dynamic Slimmable Network (DS-Net)
Our DS-Net is empowered with the ability of dynamic inference by the proposed double-headed dynamic gate.
It consistently outperforms its static counterparts as well as state-of-the-art static and dynamic model compression methods.
arXiv Detail & Related papers (2021-03-24T15:25:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.