Runtime Deep Model Multiplexing for Reduced Latency and Energy
Consumption Inference
- URL: http://arxiv.org/abs/2001.05870v2
- Date: Thu, 17 Sep 2020 17:07:31 GMT
- Title: Runtime Deep Model Multiplexing for Reduced Latency and Energy
Consumption Inference
- Authors: Amir Erfan Eshratifar and Massoud Pedram
- Abstract summary: We propose a learning algorithm to design a light-weight neural multiplexer that calls the model that will consume the minimum compute resources for a successful inference.
Mobile devices can use the proposed algorithm to offload the hard inputs to the cloud while inferring the easy ones locally.
- Score: 6.896677899938492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a learning algorithm to design a light-weight neural multiplexer
that given the input and computational resource requirements, calls the model
that will consume the minimum compute resources for a successful inference.
Mobile devices can use the proposed algorithm to offload the hard inputs to the
cloud while inferring the easy ones locally. Besides, in the large scale
cloud-based intelligent applications, instead of replicating the most-accurate
model, a range of small and large models can be multiplexed from depending on
the input's complexity which will save the cloud's computational resources. The
input complexity or hardness is determined by the number of models that can
predict the correct label. For example, if no model can predict the label
correctly, then the input is considered as the hardest. The proposed algorithm
allows the mobile device to detect the inputs that can be processed locally and
the ones that require a larger model and should be sent a cloud server.
Therefore, the mobile user benefits from not only the local processing but also
from an accurate model hosted on a cloud server. Our experimental results show
that the proposed algorithm improves mobile's model accuracy by 8.52% which is
because of those inputs that are properly selected and offloaded to the cloud
server. In addition, it saves the cloud providers' compute resources by a
factor of 2.85x as small models are chosen for easier inputs.
Related papers
- Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution [1.8029479474051309]
We design a hybrid edge-cloud solution that leverages the efficiency of smaller models for local processing while deferring to larger, more accurate cloud-based models when necessary.
Specifically, we propose a novel unsupervised data generation method, Dual-Model Distillation (DMD), to train a lightweight switcher model that can predict when the edge model's output is uncertain.
Experimental results on the action classification task show that our framework not only requires less computational overhead, but also improves accuracy compared to using a large model alone.
arXiv Detail & Related papers (2024-10-16T02:06:27Z) - Combining Cloud and Mobile Computing for Machine Learning [2.595189746033637]
We consider model segmentation as a solution to improving the user experience.
We show that the division not only reduces the wait time for users but can also be fine-tuned to optimize the workloads of the cloud.
arXiv Detail & Related papers (2024-01-20T06:14:22Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - An Ensemble Mobile-Cloud Computing Method for Affordable and Accurate
Glucometer Readout [0.0]
We present an ensemble learning algorithm, a mobile-cloud computing service architecture, and a simple compression technique to achieve higher availability and faster response time.
Our proposed method achieves 92.1% and 97.7% accuracy on two different datasets, improving previous methods by 40%, (2) reducing required bandwidth by 45x with 1% drop in accuracy, (3) and providing better availability compared to mobile-only, cloud-only, split computing, and early exit service models.
arXiv Detail & Related papers (2023-01-04T18:48:53Z) - LCS: Learning Compressible Subspaces for Adaptive Network Compression at
Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Complexity-aware Adaptive Training and Inference for Edge-Cloud
Distributed AI Systems [9.273593723275544]
IoT and machine learning applications create large amounts of data that require real-time processing.
We propose a distributed AI system to exploit both the edge and the cloud for training and inference.
arXiv Detail & Related papers (2021-09-14T05:03:54Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - Computation on Sparse Neural Networks: an Inspiration for Future
Hardware [20.131626638342706]
We describe the current status of the research on the computation of sparse neural networks.
We discuss the model accuracy influenced by the number of weight parameters and the structure of the model.
We show that for practically complicated problems, it is more beneficial to search large and sparse models in the weight dominated region.
arXiv Detail & Related papers (2020-04-24T19:13:50Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.