ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization
Of Ephemeral Cloud Resources
- URL: http://arxiv.org/abs/2009.11208v4
- Date: Thu, 10 Dec 2020 10:48:38 GMT
- Title: ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization
Of Ephemeral Cloud Resources
- Authors: Mohamed Handaoui and Jean-Emile Dartois and Jalil Boukhobza and
Olivier Barais and Laurent d'Orazio
- Abstract summary: We propose a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud.
Our solution reduces significantly the SLA violation penalties on average by 2.7x and up to 3.4x.
It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.
- Score: 2.205500582481277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cloud data center capacities are over-provisioned to handle demand peaks and
hardware failures which leads to low resources' utilization. One way to improve
resource utilization and thus reduce the total cost of ownership is to offer
unused resources (referred to as ephemeral resources) at a lower price.
However, reselling resources needs to meet the expectations of its customers in
terms of Quality of Service. The goal is so to maximize the amount of reclaimed
resources while avoiding SLA penalties. To achieve that, cloud providers have
to estimate their future utilization to provide availability guarantees. The
prediction should consider a safety margin for resources to react to
unpredictable workloads. The challenge is to find the safety margin that
provides the best trade-off between the amount of resources to reclaim and the
risk of SLA violations. Most state-of-the-art solutions consider a fixed safety
margin for all types of metrics (e.g., CPU, RAM). However, a unique fixed
margin does not consider various workloads variations over time which may lead
to SLA violations or/and poor utilization. In order to tackle these challenges,
we propose ReLeaSER, a Reinforcement Learning strategy for optimizing the
ephemeral resources' utilization in the cloud. ReLeaSER dynamically tunes the
safety margin at the host-level for each resource metric. The strategy learns
from past prediction errors (that caused SLA violations). Our solution reduces
significantly the SLA violation penalties on average by 2.7x and up to 3.4x. It
also improves considerably the CPs' potential savings by 27.6% on average and
up to 43.6%.
Related papers
- Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.
Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.
It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z) - Multi-Level ML Based Burst-Aware Autoscaling for SLO Assurance and Cost
Efficiency [3.5624365288866007]
This paper introduces BAScaler, a Burst-Aware Autoscaling framework for containerized cloud services or applications under complex workloads.
BAScaler incorporates a novel prediction-based burst detection mechanism that distinguishes between predictable periodic workload spikes and actual bursts.
arXiv Detail & Related papers (2024-02-20T12:28:25Z) - Secure Deep Reinforcement Learning for Dynamic Resource Allocation in
Wireless MEC Networks [46.689212344009015]
This paper proposes a blockchain-secured deep reinforcement learning (BC-DRL) optimization framework for data management and resource allocation in mobile edge computing networks.
We design a low-latency reputation-based proof-of-stake (RPoS) consensus protocol to select highly reliable blockchain-enabled BSs.
We provide extensive simulation results and analysis to validate that our BC-DRL framework achieves higher security, reliability, and resource utilization efficiency than benchmark blockchain consensus protocols and MEC resource allocation algorithms.
arXiv Detail & Related papers (2023-12-13T09:39:32Z) - Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference
Serving Systems [0.0]
InfAdapter proactively selects a set of ML model variants with their resource allocations to meet latency SLO.
It decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler.
arXiv Detail & Related papers (2023-04-21T11:19:49Z) - RISCLESS: A Reinforcement Learning Strategy to Exploit Unused Cloud
Resources [0.44634886884474834]
One of the main objectives of Cloud Providers (CPs) is to guarantee the Service-Level Agreement (SLA) of customers.
This paper proposes RISCLESS, a Reinforcement Learning strategy to exploit unused Cloud resources.
arXiv Detail & Related papers (2022-04-28T06:49:24Z) - PROMPT: Learning Dynamic Resource Allocation Policies for Network
Applications [16.812611987082082]
We propose PROMPT, a novel resource allocation framework using proactive prediction to guide a reinforcement learning controller.
We show that PROMPT incurs 4.2x fewer violations, reduces severity of policy violations by 12.7x, improves best-effort workload performance, and improves overall power efficiency over prior work.
arXiv Detail & Related papers (2022-01-19T23:34:34Z) - An Intelligent Resource Reservation for Crowdsourced Live Video
Streaming Applications in Geo-Distributed Cloud Environment [45.61165288624505]
We introduce a machine-learning based predictive resource allocation framework for geo-distributed cloud sites.
First, we present an offline optimization that decides the required resources in distributed regions near the viewers.
Second, we use machine learning to build forecasting models that proactively predict the resources to be reserved at each cloud site ahead of time.
arXiv Detail & Related papers (2021-06-04T11:45:09Z) - Coordinated Online Learning for Multi-Agent Systems with Coupled
Constraints and Perturbed Utility Observations [91.02019381927236]
We introduce a novel method to steer the agents toward a stable population state, fulfilling the given resource constraints.
The proposed method is a decentralized resource pricing method based on the resource loads resulting from the augmentation of the game's Lagrangian.
arXiv Detail & Related papers (2020-10-21T10:11:17Z) - A Predictive Autoscaler for Elastic Batch Jobs [8.354712625979776]
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service.
We propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances.
arXiv Detail & Related papers (2020-10-10T17:35:55Z) - Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors.
Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z) - Hierarchical Adaptive Contextual Bandits for Resource Constraint based
Recommendation [49.69139684065241]
Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems.
In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint.
arXiv Detail & Related papers (2020-04-02T17:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.