Related papers: FusedInf: Efficient Swapping of DNN Models for On-Demand Serverless Inference Services on the Edge

FusedInf: Efficient Swapping of DNN Models for On-Demand Serverless Inference Services on the Edge

URL: http://arxiv.org/abs/2410.21120v1
Date: Mon, 28 Oct 2024 15:21:23 GMT
Title: FusedInf: Efficient Swapping of DNN Models for On-Demand Serverless Inference Services on the Edge
Authors: Sifat Ut Taki, Arthi Padmanabhan, Spyridon Mastorakis,
Abstract summary: We introduce FusedInf to efficiently swap DNN models for on-demand serverless inference services on the edge. Our evaluation of popular DNN models showed that creating a single DAG can make the execution of the models up to 14% faster.
Score: 2.1119495676190128
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Edge AI computing boxes are a new class of computing devices that are aimed to revolutionize the AI industry. These compact and robust hardware units bring the power of AI processing directly to the source of data--on the edge of the network. On the other hand, on-demand serverless inference services are becoming more and more popular as they minimize the infrastructural cost associated with hosting and running DNN models for small to medium-sized businesses. However, these computing devices are still constrained in terms of resource availability. As such, the service providers need to load and unload models efficiently in order to meet the growing demand. In this paper, we introduce FusedInf to efficiently swap DNN models for on-demand serverless inference services on the edge. FusedInf combines multiple models into a single Direct Acyclic Graph (DAG) to efficiently load the models into the GPU memory and make execution faster. Our evaluation of popular DNN models showed that creating a single DAG can make the execution of the models up to 14\% faster while reducing the memory requirement by up to 17\%. The prototype implementation is available at https://github.com/SifatTaj/FusedInf.

Related papers

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals [58.83169560132308]
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks.
arXiv Detail & Related papers (2024-07-18T17:59:01Z)
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop. In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z)
Edge AI as a Service with Coordinated Deep Neural Networks [0.24578723416255746]
CoDE aims to find the optimal path, which is the path with the highest possible reward, by creating multi-task DNNs from individual models. Experiments show that CoDE enhances the inference throughput and, achieves higher precision compared to a state-of-the-art existing method.
arXiv Detail & Related papers (2024-01-01T01:54:53Z)
DNNShifter: An Efficient DNN Pruning System for Edge Computing [1.853502789996996]
Deep neural networks (DNNs) underpin many machine learning applications. Production quality DNN models achieve high inference accuracy by training millions of DNN parameters which has a significant resource footprint. This presents a challenge for resources operating at the extreme edge of the network, such as mobile and embedded devices that have limited computational and memory resources. Existing pruning methods are unable to provide similar quality models compared to their unpruned counterparts without significant time costs and overheads or are limited to offline use cases. Our work rapidly derives suitable model variants while maintaining the accuracy of the original model. The model variants can be swapped quickly when system
arXiv Detail & Related papers (2023-09-13T14:05:50Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision. A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive. We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z)
Improving the Performance of DNN-based Software Services using Automated Layer Caching [3.804240190982695]
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. The computational complexity in such large models can still be relatively significant, hindering low inference latency. In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services.
arXiv Detail & Related papers (2022-09-18T18:21:20Z)
An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks [0.0]
Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. We propose a new software layer to serve with flexibility and efficiency ensembles of DNNs.
arXiv Detail & Related papers (2022-08-30T08:05:43Z)
Update Compression for Deep Neural Networks on the Edge [33.57905298104467]
An increasing number of AI applications involve the execution of deep neural networks (DNNs) on edge devices. Many practical reasons motivate the need to update the DNN model on the edge device post-deployment. We develop a simple approach based on matrix factorisation to compress the model update.
arXiv Detail & Related papers (2022-03-09T04:20:43Z)
ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware. The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation. We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z)
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks. We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.