Accelerating Deep Learning Inference via Freezing
- URL: http://arxiv.org/abs/2002.02645v1
- Date: Fri, 7 Feb 2020 07:03:58 GMT
- Title: Accelerating Deep Learning Inference via Freezing
- Authors: Adarsh Kumar, Arjun Balasubramanian, Shivaram Venkataraman, Aditya
Akella
- Abstract summary: We present Freeze Inference, a system that introduces approximate caching at each intermediate layer.
We find that this can potentially reduce the number of effective layers by half for 91.58% of CIFAR-10 requests run on ResNet-18.
- Score: 8.521443408415868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over the last few years, Deep Neural Networks (DNNs) have become ubiquitous
owing to their high accuracy on real-world tasks. However, this increase in
accuracy comes at the cost of computationally expensive models leading to
higher prediction latencies. Prior efforts to reduce this latency such as
quantization, model distillation, and any-time prediction models typically
trade-off accuracy for performance. In this work, we observe that caching
intermediate layer outputs can help us avoid running all the layers of a DNN
for a sizeable fraction of inference requests. We find that this can
potentially reduce the number of effective layers by half for 91.58% of
CIFAR-10 requests run on ResNet-18. We present Freeze Inference, a system that
introduces approximate caching at each intermediate layer and we discuss
techniques to reduce the cache size and improve the cache hit rate. Finally, we
discuss some of the open research challenges in realizing such a design.
Related papers
- Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture.
DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.
Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z) - Fast Exploration of the Impact of Precision Reduction on Spiking Neural
Networks [63.614519238823206]
Spiking Neural Networks (SNNs) are a practical choice when the target hardware reaches the edge of computing.
We employ an Interval Arithmetic (IA) model to develop an exploration methodology that takes advantage of the capability of such a model to propagate the approximation error.
arXiv Detail & Related papers (2022-11-22T15:08:05Z) - Improving the Performance of DNN-based Software Services using Automated
Layer Caching [3.804240190982695]
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services.
The computational complexity in such large models can still be relatively significant, hindering low inference latency.
In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services.
arXiv Detail & Related papers (2022-09-18T18:21:20Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z) - Accelerating Deep Learning Inference via Learned Caches [11.617579969991294]
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems.
Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads.
We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
arXiv Detail & Related papers (2021-01-18T22:13:08Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - CacheNet: A Model Caching Framework for Deep Learning Inference on the
Edge [3.398008512297358]
CacheNet is a model caching framework for machine perception applications.
It caches low-complexity models on end devices and high-complexity (or full) models on edge or cloud servers.
It is 58-217% faster than baseline approaches that run inference tasks on end devices or edge servers alone.
arXiv Detail & Related papers (2020-07-03T16:32:14Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z) - Compression of descriptor models for mobile applications [26.498907514590165]
We evaluate the computational cost, model size, and matching accuracy tradeoffs for deep neural networks.
We observe a significant redundancy in the learned weights, which we exploit through the use of depthwise separable layers.
We propose the Convolution-Depthwise-Pointwise(CDP) layer, which provides a means of interpolating between the standard and depthwise separable convolutions.
arXiv Detail & Related papers (2020-01-09T17:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.