CacheNet: A Model Caching Framework for Deep Learning Inference on the
Edge
- URL: http://arxiv.org/abs/2007.01793v1
- Date: Fri, 3 Jul 2020 16:32:14 GMT
- Title: CacheNet: A Model Caching Framework for Deep Learning Inference on the
Edge
- Authors: Yihao Fang, Shervin Manzuri Shalmani, and Rong Zheng
- Abstract summary: CacheNet is a model caching framework for machine perception applications.
It caches low-complexity models on end devices and high-complexity (or full) models on edge or cloud servers.
It is 58-217% faster than baseline approaches that run inference tasks on end devices or edge servers alone.
- Score: 3.398008512297358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of deep neural networks (DNN) in machine perception applications
such as image classification and speech recognition comes at the cost of high
computation and storage complexity. Inference of uncompressed large scale DNN
models can only run in the cloud with extra communication latency back and
forth between cloud and end devices, while compressed DNN models achieve
real-time inference on end devices at the price of lower predictive accuracy.
In order to have the best of both worlds (latency and accuracy), we propose
CacheNet, a model caching framework. CacheNet caches low-complexity models on
end devices and high-complexity (or full) models on edge or cloud servers. By
exploiting temporal locality in streaming data, high cache hit and consequently
shorter latency can be achieved with no or only marginal decrease in prediction
accuracy. Experiments on CIFAR-10 and FVG have shown CacheNet is 58-217% faster
than baseline approaches that run inference tasks on end devices or edge
servers alone.
Related papers
- A Converting Autoencoder Toward Low-latency and Energy-efficient DNN
Inference at the Edge [4.11949030493552]
We present CBNet, a low-latency and energy-efficient deep neural network (DNN) inference framework tailored for edge devices.
It utilizes a "converting" autoencoder to efficiently transform hard images into easy ones.
CBNet achieves up to 4.8x speedup in inference latency and 79% reduction in energy usage compared to competing techniques.
arXiv Detail & Related papers (2024-03-11T08:13:42Z) - DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture.
DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.
Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z) - Deep Learning for Day Forecasts from Sparse Observations [60.041805328514876]
Deep neural networks offer an alternative paradigm for modeling weather conditions.
MetNet-3 learns from both dense and sparse data sensors and makes predictions up to 24 hours ahead for precipitation, wind, temperature and dew point.
MetNet-3 has a high temporal and spatial resolution, respectively, up to 2 minutes and 1 km as well as a low operational latency.
arXiv Detail & Related papers (2023-06-06T07:07:54Z) - Attention-based Feature Compression for CNN Inference Offloading in Edge
Computing [93.67044879636093]
This paper studies the computational offloading of CNN inference in device-edge co-inference systems.
We propose a novel autoencoder-based CNN architecture (AECNN) for effective feature extraction at end-device.
Experiments show that AECNN can compress the intermediate data by more than 256x with only about 4% accuracy loss.
arXiv Detail & Related papers (2022-11-24T18:10:01Z) - Streaming Video Analytics On The Edge With Asynchronous Cloud Support [2.7456483236562437]
We propose a novel edge-cloud fusion algorithm that fuses edge and cloud predictions, achieving low latency and high accuracy.
We focus on object detection in videos (applicable in many video analytics scenarios) and show that the fused edge-cloud predictions can outperform the accuracy of edge-only and cloud-only scenarios by as much as 50%.
arXiv Detail & Related papers (2022-10-04T06:22:13Z) - Improving the Performance of DNN-based Software Services using Automated
Layer Caching [3.804240190982695]
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services.
The computational complexity in such large models can still be relatively significant, hindering low inference latency.
In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services.
arXiv Detail & Related papers (2022-09-18T18:21:20Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - Accelerating Deep Learning Inference via Learned Caches [11.617579969991294]
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems.
Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads.
We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
arXiv Detail & Related papers (2021-01-18T22:13:08Z) - Calibration-Aided Edge Inference Offloading via Adaptive Model
Partitioning of Deep Neural Networks [30.800324092046793]
Mobile devices can offload deep neural network (DNN)-based inference to the cloud, overcoming local hardware and energy limitations.
This work shows that the employment of a miscalibrated early-exit DNN for offloading via model partitioning can significantly decrease inference accuracy.
In contrast, we argue that implementing a calibration algorithm prior to deployment can solve this problem, allowing for more reliable offloading decisions.
arXiv Detail & Related papers (2020-10-30T15:50:12Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.