Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference
Serving Systems
- URL: http://arxiv.org/abs/2304.10892v2
- Date: Mon, 24 Apr 2023 12:47:45 GMT
- Title: Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference
Serving Systems
- Authors: Mehran Salmani (1), Saeid Ghafouri (2 and 4), Alireza Sanaee (2),
Kamran Razavi (3), Max M\"uhlh\"auser (3), Joseph Doyle (2), Pooyan Jamshidi
(4), Mohsen Sharifi (1) ((1) Iran University of Science and Technology, (2)
Queen Mary University of London, (3) Technical University of Darmstadt, (4)
University of South Carolina)
- Abstract summary: InfAdapter proactively selects a set of ML model variants with their resource allocations to meet latency SLO.
It decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of machine learning (ML) inference for various applications is
growing drastically. ML inference services engage with users directly,
requiring fast and accurate responses. Moreover, these services face dynamic
workloads of requests, imposing changes in their computing resources. Failing
to right-size computing resources results in either latency service level
objectives (SLOs) violations or wasted computing resources. Adapting to dynamic
workloads considering all the pillars of accuracy, latency, and resource cost
is challenging. In response to these challenges, we propose InfAdapter, which
proactively selects a set of ML model variants with their resource allocations
to meet latency SLO while maximizing an objective function composed of accuracy
and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%,
respectively, compared to a popular industry autoscaler (Kubernetes Vertical
Pod Autoscaler).
Related papers
- Towards Resource-Efficient Federated Learning in Industrial IoT for Multivariate Time Series Analysis [50.18156030818883]
Anomaly and missing data constitute a thorny problem in industrial applications.
Deep learning enabled anomaly detection has emerged as a critical direction.
The data collected in edge devices contain user privacy.
arXiv Detail & Related papers (2024-11-06T15:38:31Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Adaptive Stream Processing on Edge Devices through Active Inference [5.5676731834895765]
We present a novel Machine Learning paradigm based on Active Inference (AIF)
AIF describes how the brain constantly predicts and evaluates sensory information to decrease long-term surprise.
Our method guarantees full transparency on the decision making, making the interpretation of the results and the troubleshooting effortless.
arXiv Detail & Related papers (2024-09-26T15:12:41Z) - Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance.
Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z) - SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees [21.801053526411415]
Large Language Models (LLMs) have significantly boosted performance in natural language processing (NLP) tasks.
The deployment of high-performance LLMs incurs substantial costs, primarily due to the increased number of parameters aimed at enhancing model performance.
We introduce SMART, a novel framework designed to minimize the inference costs of NLP tasks while ensuring sufficient result quality.
arXiv Detail & Related papers (2024-03-11T17:45:47Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - Multi-Level ML Based Burst-Aware Autoscaling for SLO Assurance and Cost
Efficiency [3.5624365288866007]
This paper introduces BAScaler, a Burst-Aware Autoscaling framework for containerized cloud services or applications under complex workloads.
BAScaler incorporates a novel prediction-based burst detection mechanism that distinguishes between predictable periodic workload spikes and actual bursts.
arXiv Detail & Related papers (2024-02-20T12:28:25Z) - Lifelong Learning for Fog Load Balancing: A Transfer Learning Approach [0.7366405857677226]
We improve the performance of privacy-aware Reinforcement Learning (RL) agents that optimize the execution delay of IoT applications by minimizing the waiting delay.
We propose a lifelong learning framework for these agents, where lightweight inference models are used during deployment to minimize action delay and only retrained in case of significant environmental changes.
arXiv Detail & Related papers (2023-10-08T14:49:33Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - A Predictive Autoscaler for Elastic Batch Jobs [8.354712625979776]
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service.
We propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances.
arXiv Detail & Related papers (2020-10-10T17:35:55Z) - Optimization-driven Machine Learning for Intelligent Reflecting Surfaces
Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts.
Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity.
In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.