KAIROS: Building Cost-Efficient Machine Learning Inference Systems with
Heterogeneous Cloud Resources
- URL: http://arxiv.org/abs/2210.05889v3
- Date: Tue, 2 May 2023 19:39:05 GMT
- Title: KAIROS: Building Cost-Efficient Machine Learning Inference Systems with
Heterogeneous Cloud Resources
- Authors: Baolin Li, Siddharth Samsi, Vijay Gadepally, Devesh Tiwari
- Abstract summary: KAIROS is a novel runtime framework that maximizes the query throughput while meeting target and a cost budget.
Our evaluation using industry-grade deep learning (DL) models shows that KAIROS yields up to 2X the throughput of an optimal homogeneous solution.
- Score: 10.462798429064277
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online inference is becoming a key service product for many businesses,
deployed in cloud platforms to meet customer demands. Despite their
revenue-generation capability, these services need to operate under tight
Quality-of-Service (QoS) and cost budget constraints. This paper introduces
KAIROS, a novel runtime framework that maximizes the query throughput while
meeting QoS target and a cost budget. KAIROS designs and implements novel
techniques to build a pool of heterogeneous compute hardware without online
exploration overhead, and distribute inference queries optimally at runtime.
Our evaluation using industry-grade deep learning (DL) models shows that KAIROS
yields up to 2X the throughput of an optimal homogeneous solution, and
outperforms state-of-the-art schemes by up to 70%, despite advantageous
implementations of the competing schemes to ignore their exploration overhead.
Related papers
- Decentralized AI: Permissionless LLM Inference on POKT Network [8.68822221491139]
POKT Network's decentralized Remote Procedure Call infrastructure has surpassed 740 billion requests since launching on MainNet in 2020.
This litepaper illustrates how the network's open-source and permissionless design aligns incentives among model researchers, hardware operators, API providers and users.
arXiv Detail & Related papers (2024-05-30T19:50:07Z) - Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing [53.748685766139715]
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size.
We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality.
In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
arXiv Detail & Related papers (2024-04-22T23:06:42Z) - A Learning-based Incentive Mechanism for Mobile AIGC Service in Decentralized Internet of Vehicles [49.86094523878003]
We propose a decentralized incentive mechanism for mobile AIGC service allocation.
We employ multi-agent deep reinforcement learning to find the balance between the supply of AIGC services on RSUs and user demand for services within the IoV context.
arXiv Detail & Related papers (2024-03-29T12:46:07Z) - A Cost-Aware Mechanism for Optimized Resource Provisioning in Cloud
Computing [6.369406986434764]
We have proposed a novel learning based resource provisioning approach that achieves cost-reduction guarantees of demands.
Our method adapts most of the requirements efficiently, and furthermore the resulting performance meets our design goals.
arXiv Detail & Related papers (2023-09-20T13:27:30Z) - Elastic Entangled Pair and Qubit Resource Management in Quantum Cloud
Computing [73.7522199491117]
Quantum cloud computing (QCC) offers a promising approach to efficiently provide quantum computing resources.
The fluctuations in user demand and quantum circuit requirements are challenging for efficient resource provisioning.
We propose a resource allocation model to provision quantum computing and networking resources.
arXiv Detail & Related papers (2023-07-25T00:38:46Z) - How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study [57.97785297481162]
We evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models.
We show how leveraging spot pricing enables a new cost-efficient way to train models with multiple cheap instance, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
arXiv Detail & Related papers (2023-06-05T18:17:37Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - CILP: Co-simulation based Imitation Learner for Dynamic Resource
Provisioning in Cloud Computing Environments [13.864161788250856]
Key challenge for latency-critical tasks is to predict future workload demands to provision proactively.
Existing AI-based solutions tend to not holistically consider all crucial aspects such as provision overheads, heterogeneous VM costs and Quality of Service (QoS) of the cloud system.
We propose a novel method, called CILP, that formulates the VM provisioning problem as two sub-problems of prediction and optimization.
arXiv Detail & Related papers (2023-02-11T09:15:34Z) - RIBBON: Cost-Effective and QoS-Aware Deep Learning Model Inference using
a Diverse Pool of Cloud Computing Instances [7.539635201319158]
RIBBON is a novel deep learning inference serving system.
It meets two competing objectives: quality-of-service (QoS) target and cost-effectiveness.
arXiv Detail & Related papers (2022-07-23T06:45:14Z) - Serving and Optimizing Machine Learning Workflows on Heterogeneous
Infrastructures [9.178035808110124]
JellyBean is a framework for serving and optimizing machine learning inference on heterogeneous infrastructures.
We show that JellyBean reduces the total serving cost of visual question answering by up to 58%, and vehicle tracking from the NVIDIA AI City Challenge by up to 36%.
arXiv Detail & Related papers (2022-05-10T07:32:32Z) - Distributed Deep Learning in Open Collaborations [49.240611132653456]
We propose a novel algorithmic framework designed specifically for collaborative training.
We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost.
arXiv Detail & Related papers (2021-06-18T16:23:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.