Multi-Level ML Based Burst-Aware Autoscaling for SLO Assurance and Cost
Efficiency
- URL: http://arxiv.org/abs/2402.12962v1
- Date: Tue, 20 Feb 2024 12:28:25 GMT
- Title: Multi-Level ML Based Burst-Aware Autoscaling for SLO Assurance and Cost
Efficiency
- Authors: Chunyang Meng, Haogang Tong, Tianyang Wu, Maolin Pan, Yang Yu
- Abstract summary: This paper introduces BAScaler, a Burst-Aware Autoscaling framework for containerized cloud services or applications under complex workloads.
BAScaler incorporates a novel prediction-based burst detection mechanism that distinguishes between predictable periodic workload spikes and actual bursts.
- Score: 3.5624365288866007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoscaling is a technology to automatically scale the resources provided to
their applications without human intervention to guarantee runtime Quality of
Service (QoS) while saving costs. However, user-facing cloud applications serve
dynamic workloads that often exhibit variable and contain bursts, posing
challenges to autoscaling for maintaining QoS within Service-Level Objectives
(SLOs). Conservative strategies risk over-provisioning, while aggressive ones
may cause SLO violations, making it more challenging to design effective
autoscaling. This paper introduces BAScaler, a Burst-Aware Autoscaling
framework for containerized cloud services or applications under complex
workloads, combining multi-level machine learning (ML) techniques to mitigate
SLO violations while saving costs. BAScaler incorporates a novel
prediction-based burst detection mechanism that distinguishes between
predictable periodic workload spikes and actual bursts. When bursts are
detected, BAScaler appropriately overestimates them and allocates resources
accordingly to address the rapid growth in resource demand. On the other hand,
BAScaler employs reinforcement learning to rectify potential inaccuracies in
resource estimation, enabling more precise resource allocation during
non-bursts. Experiments across ten real-world workloads demonstrate BAScaler's
effectiveness, achieving a 57% average reduction in SLO violations and cutting
resource costs by 10% compared to other prominent methods.
Related papers
- OptScaler: A Hybrid Proactive-Reactive Framework for Robust Autoscaling
in the Cloud [11.340252931723063]
Autoscaling is a vital mechanism in cloud computing that supports the autonomous adjustment of computing resources under dynamic workloads.
Existing proactive autoscaling methods anticipate the future workload and scale the resources in advance, whereas reactive methods rely on real-time system feedback.
This paper presents OptScaler, a hybrid autoscaling framework that integrates the power of both proactive and reactive methods for regulating CPU utilization.
arXiv Detail & Related papers (2023-10-26T04:38:48Z) - Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference
Serving Systems [0.0]
InfAdapter proactively selects a set of ML model variants with their resource allocations to meet latency SLO.
It decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler.
arXiv Detail & Related papers (2023-04-21T11:19:49Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - Guaranteed Dynamic Scheduling of Ultra-Reliable Low-Latency Traffic via
Conformal Prediction [72.59079526765487]
The dynamic scheduling of ultra-reliable and low-latency traffic (URLLC) in the uplink can significantly enhance the efficiency of coexisting services.
The main challenge is posed by the uncertainty in the process of URLLC packet generation.
We introduce a novel scheduler for URLLC packets that provides formal guarantees on reliability and latency irrespective of the quality of the URLLC traffic predictor.
arXiv Detail & Related papers (2023-02-15T14:09:55Z) - TransPath: Learning Heuristics For Grid-Based Pathfinding via
Transformers [64.88759709443819]
We suggest learning the instance-dependent proxies that are supposed to notably increase the efficiency of the search.
The first proxy we suggest to learn is the correction factor, i.e. the ratio between the instance independent cost-to-go estimate and the perfect one.
The second proxy is the path probability, which indicates how likely the grid cell is lying on the shortest path.
arXiv Detail & Related papers (2022-12-22T14:26:11Z) - Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement
Learning Approach [130.9259586568977]
We propose novel learning algorithms to recover the dynamic Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction.
A key contribution of our approach is incorporating reward-free online Reinforcement Learning (RL) to aid exploration over a rich policy space.
arXiv Detail & Related papers (2022-02-25T16:17:23Z) - PROMPT: Learning Dynamic Resource Allocation Policies for Network
Applications [16.812611987082082]
We propose PROMPT, a novel resource allocation framework using proactive prediction to guide a reinforcement learning controller.
We show that PROMPT incurs 4.2x fewer violations, reduces severity of policy violations by 12.7x, improves best-effort workload performance, and improves overall power efficiency over prior work.
arXiv Detail & Related papers (2022-01-19T23:34:34Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z) - Coordinated Online Learning for Multi-Agent Systems with Coupled
Constraints and Perturbed Utility Observations [91.02019381927236]
We introduce a novel method to steer the agents toward a stable population state, fulfilling the given resource constraints.
The proposed method is a decentralized resource pricing method based on the resource loads resulting from the augmentation of the game's Lagrangian.
arXiv Detail & Related papers (2020-10-21T10:11:17Z) - A Predictive Autoscaler for Elastic Batch Jobs [8.354712625979776]
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service.
We propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances.
arXiv Detail & Related papers (2020-10-10T17:35:55Z) - ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization
Of Ephemeral Cloud Resources [2.205500582481277]
We propose a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud.
Our solution reduces significantly the SLA violation penalties on average by 2.7x and up to 3.4x.
It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.
arXiv Detail & Related papers (2020-09-23T15:19:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.