Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference
Serving Systems
- URL: http://arxiv.org/abs/2304.10892v2
- Date: Mon, 24 Apr 2023 12:47:45 GMT
- Title: Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference
Serving Systems
- Authors: Mehran Salmani (1), Saeid Ghafouri (2 and 4), Alireza Sanaee (2),
Kamran Razavi (3), Max M\"uhlh\"auser (3), Joseph Doyle (2), Pooyan Jamshidi
(4), Mohsen Sharifi (1) ((1) Iran University of Science and Technology, (2)
Queen Mary University of London, (3) Technical University of Darmstadt, (4)
University of South Carolina)
- Abstract summary: InfAdapter proactively selects a set of ML model variants with their resource allocations to meet latency SLO.
It decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of machine learning (ML) inference for various applications is
growing drastically. ML inference services engage with users directly,
requiring fast and accurate responses. Moreover, these services face dynamic
workloads of requests, imposing changes in their computing resources. Failing
to right-size computing resources results in either latency service level
objectives (SLOs) violations or wasted computing resources. Adapting to dynamic
workloads considering all the pillars of accuracy, latency, and resource cost
is challenging. In response to these challenges, we propose InfAdapter, which
proactively selects a set of ML model variants with their resource allocations
to meet latency SLO while maximizing an objective function composed of accuracy
and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%,
respectively, compared to a popular industry autoscaler (Kubernetes Vertical
Pod Autoscaler).
Related papers
- Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.
However, they still struggle with problems requiring multi-step decision-making and environmental feedback.
We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Adaptive Resource Allocation Optimization Using Large Language Models in Dynamic Wireless Environments [25.866960634041092]
Current solutions rely on domain-specific architectures or techniques, and a general DL approach for constrained optimization remains undeveloped.
We propose a large language model for resource allocation (LLM-RAO) to address the complex resource allocation problem while adhering to constraints.
LLM-RAO achieves up to a 40% performance enhancement compared to conventional DL methods and up to an $80$% improvement over analytical approaches.
arXiv Detail & Related papers (2025-02-04T12:56:59Z) - Towards Resource-Efficient Federated Learning in Industrial IoT for Multivariate Time Series Analysis [50.18156030818883]
Anomaly and missing data constitute a thorny problem in industrial applications.
Deep learning enabled anomaly detection has emerged as a critical direction.
The data collected in edge devices contain user privacy.
arXiv Detail & Related papers (2024-11-06T15:38:31Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Adaptive Stream Processing on Edge Devices through Active Inference [5.5676731834895765]
We present a novel Machine Learning paradigm based on Active Inference (AIF)
AIF describes how the brain constantly predicts and evaluates sensory information to decrease long-term surprise.
Our method guarantees full transparency on the decision making, making the interpretation of the results and the troubleshooting effortless.
arXiv Detail & Related papers (2024-09-26T15:12:41Z) - Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance.
Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z) - Multi-Level ML Based Burst-Aware Autoscaling for SLO Assurance and Cost
Efficiency [3.5624365288866007]
This paper introduces BAScaler, a Burst-Aware Autoscaling framework for containerized cloud services or applications under complex workloads.
BAScaler incorporates a novel prediction-based burst detection mechanism that distinguishes between predictable periodic workload spikes and actual bursts.
arXiv Detail & Related papers (2024-02-20T12:28:25Z) - Lifelong Learning for Fog Load Balancing: A Transfer Learning Approach [0.7366405857677226]
We improve the performance of privacy-aware Reinforcement Learning (RL) agents that optimize the execution delay of IoT applications by minimizing the waiting delay.
We propose a lifelong learning framework for these agents, where lightweight inference models are used during deployment to minimize action delay and only retrained in case of significant environmental changes.
arXiv Detail & Related papers (2023-10-08T14:49:33Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - A Predictive Autoscaler for Elastic Batch Jobs [8.354712625979776]
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service.
We propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances.
arXiv Detail & Related papers (2020-10-10T17:35:55Z) - Optimization-driven Machine Learning for Intelligent Reflecting Surfaces
Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts.
Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity.
In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.