Related papers: KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

URL: http://arxiv.org/abs/2210.05889v3
Date: Tue, 2 May 2023 19:39:05 GMT
Title: KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
Authors: Baolin Li, Siddharth Samsi, Vijay Gadepally, Devesh Tiwari
Abstract summary: KAIROS is a novel runtime framework that maximizes the query throughput while meeting target and a cost budget. Our evaluation using industry-grade deep learning (DL) models shows that KAIROS yields up to 2X the throughput of an optimal homogeneous solution.
Score: 10.462798429064277
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces KAIROS, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. KAIROS designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade deep learning (DL) models shows that KAIROS yields up to 2X the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.

Related papers

Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey [59.52058740470727]
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications.<n>Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems.<n>This survey provides a structured tutorial on fundamental architectures, enabling technologies, and emerging applications.
arXiv Detail & Related papers (2025-05-03T13:55:38Z)
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [56.98081258047281]
Collaborative Inference with Token-lEvel Routing (CITER) is a framework that enables efficient collaboration between small and large language models. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z)
Microservice Deployment in Space Computing Power Networks via Robust Reinforcement Learning [43.96374556275842]
It is important to provide reliable real-time remote sensing inference services to meet the low-latency requirements. This paper presents a remote sensing artificial intelligence applications deployment framework designed for Low Earth Orbit satellite constellations.
arXiv Detail & Related papers (2025-01-08T16:55:04Z)
Decentralized AI: Permissionless LLM Inference on POKT Network [8.68822221491139]
POKT Network's decentralized Remote Procedure Call infrastructure has surpassed 740 billion requests since launching on MainNet in 2020. This litepaper illustrates how the network's open-source and permissionless design aligns incentives among model researchers, hardware operators, API providers and users.
arXiv Detail & Related papers (2024-05-30T19:50:07Z)
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing [53.748685766139715]
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size. We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
arXiv Detail & Related papers (2024-04-22T23:06:42Z)
A Learning-based Incentive Mechanism for Mobile AIGC Service in Decentralized Internet of Vehicles [49.86094523878003]
We propose a decentralized incentive mechanism for mobile AIGC service allocation. We employ multi-agent deep reinforcement learning to find the balance between the supply of AIGC services on RSUs and user demand for services within the IoV context.
arXiv Detail & Related papers (2024-03-29T12:46:07Z)
A Cost-Aware Mechanism for Optimized Resource Provisioning in Cloud Computing [6.369406986434764]
We have proposed a novel learning based resource provisioning approach that achieves cost-reduction guarantees of demands. Our method adapts most of the requirements efficiently, and furthermore the resulting performance meets our design goals.
arXiv Detail & Related papers (2023-09-20T13:27:30Z)
Elastic Entangled Pair and Qubit Resource Management in Quantum Cloud Computing [73.7522199491117]
Quantum cloud computing (QCC) offers a promising approach to efficiently provide quantum computing resources. The fluctuations in user demand and quantum circuit requirements are challenging for efficient resource provisioning. We propose a resource allocation model to provision quantum computing and networking resources.
arXiv Detail & Related papers (2023-07-25T00:38:46Z)
How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study [57.97785297481162]
We evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models. We show how leveraging spot pricing enables a new cost-efficient way to train models with multiple cheap instance, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
arXiv Detail & Related papers (2023-06-05T18:17:37Z)
Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption. Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy. We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z)
CILP: Co-simulation based Imitation Learner for Dynamic Resource Provisioning in Cloud Computing Environments [13.864161788250856]
Key challenge for latency-critical tasks is to predict future workload demands to provision proactively. Existing AI-based solutions tend to not holistically consider all crucial aspects such as provision overheads, heterogeneous VM costs and Quality of Service (QoS) of the cloud system. We propose a novel method, called CILP, that formulates the VM provisioning problem as two sub-problems of prediction and optimization.
arXiv Detail & Related papers (2023-02-11T09:15:34Z)
RIBBON: Cost-Effective and QoS-Aware Deep Learning Model Inference using a Diverse Pool of Cloud Computing Instances [7.539635201319158]
RIBBON is a novel deep learning inference serving system. It meets two competing objectives: quality-of-service (QoS) target and cost-effectiveness.
arXiv Detail & Related papers (2022-07-23T06:45:14Z)
Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures [9.178035808110124]
JellyBean is a framework for serving and optimizing machine learning inference on heterogeneous infrastructures. We show that JellyBean reduces the total serving cost of visual question answering by up to 58%, and vehicle tracking from the NVIDIA AI City Challenge by up to 36%.
arXiv Detail & Related papers (2022-05-10T07:32:32Z)
Distributed Deep Learning in Open Collaborations [49.240611132653456]
We propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost.
arXiv Detail & Related papers (2021-06-18T16:23:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.