Related papers: ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources

ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources

URL: http://arxiv.org/abs/2009.11208v4
Date: Thu, 10 Dec 2020 10:48:38 GMT
Title: ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources
Authors: Mohamed Handaoui and Jean-Emile Dartois and Jalil Boukhobza and Olivier Barais and Laurent d'Orazio
Abstract summary: We propose a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. Our solution reduces significantly the SLA violation penalties on average by 2.7x and up to 3.4x. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.
Score: 2.205500582481277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cloud data center capacities are over-provisioned to handle demand peaks and hardware failures which leads to low resources' utilization. One way to improve resource utilization and thus reduce the total cost of ownership is to offer unused resources (referred to as ephemeral resources) at a lower price. However, reselling resources needs to meet the expectations of its customers in terms of Quality of Service. The goal is so to maximize the amount of reclaimed resources while avoiding SLA penalties. To achieve that, cloud providers have to estimate their future utilization to provide availability guarantees. The prediction should consider a safety margin for resources to react to unpredictable workloads. The challenge is to find the safety margin that provides the best trade-off between the amount of resources to reclaim and the risk of SLA violations. Most state-of-the-art solutions consider a fixed safety margin for all types of metrics (e.g., CPU, RAM). However, a unique fixed margin does not consider various workloads variations over time which may lead to SLA violations or/and poor utilization. In order to tackle these challenges, we propose ReLeaSER, a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. ReLeaSER dynamically tunes the safety margin at the host-level for each resource metric. The strategy learns from past prediction errors (that caused SLA violations). Our solution reduces significantly the SLA violation penalties on average by 2.7x and up to 3.4x. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.

Related papers

Intelligent Resource Allocation Optimization for Cloud Computing via Machine Learning [11.657154571216234]
This paper proposes an intelligent resource allocation algorithm that leverages deep learning (LSTM) for demand prediction and reinforcement learning (DQN) for dynamic scheduling. The proposed system enhances resource utilization by 32.5%, reduces average response time by 43.3%, and lowers operational costs by 26.6%.
arXiv Detail & Related papers (2025-03-21T23:06:43Z)
Network Resource Optimization for ML-Based UAV Condition Monitoring with Vibration Analysis [54.550658461477106]
Condition Monitoring (CM) uses Machine Learning (ML) models to identify abnormal and adverse conditions. This work explores the optimization of network resources for ML-based UAV CM frameworks. By leveraging dimensionality reduction techniques, there is a 99.9% reduction in network resource consumption.
arXiv Detail & Related papers (2025-02-21T14:36:12Z)
Secure Resource Allocation via Constrained Deep Reinforcement Learning [49.15061461220109]
We present SARMTO, a framework that balances resource allocation, task offloading, security, and performance. SARMTO consistently outperforms five baseline approaches, achieving up to a 40% reduction in system costs. These enhancements highlight SARMTO's potential to revolutionize resource management in intricate distributed computing environments.
arXiv Detail & Related papers (2025-01-20T15:52:43Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
Multi-Level ML Based Burst-Aware Autoscaling for SLO Assurance and Cost Efficiency [3.5624365288866007]
This paper introduces BAScaler, a Burst-Aware Autoscaling framework for containerized cloud services or applications under complex workloads. BAScaler incorporates a novel prediction-based burst detection mechanism that distinguishes between predictable periodic workload spikes and actual bursts.
arXiv Detail & Related papers (2024-02-20T12:28:25Z)
A Federated Framework for LLM-based Recommendation [65.12855401912948]
Large Language Models (LLMs) have empowered generative recommendation systems through fine-tuning user behavior data. utilizing the user data may pose significant privacy risks, potentially leading to ethical dilemmas and violations of data protection regulations. To address the privacy concerns, Federated Learning for Recommendation (Fed4Rec) has been identified as a promising solution.
arXiv Detail & Related papers (2024-02-15T14:09:28Z)
Secure Deep Reinforcement Learning for Dynamic Resource Allocation in Wireless MEC Networks [46.689212344009015]
This paper proposes a blockchain-secured deep reinforcement learning (BC-DRL) optimization framework for data management and resource allocation in mobile edge computing networks. We design a low-latency reputation-based proof-of-stake (RPoS) consensus protocol to select highly reliable blockchain-enabled BSs. We provide extensive simulation results and analysis to validate that our BC-DRL framework achieves higher security, reliability, and resource utilization efficiency than benchmark blockchain consensus protocols and MEC resource allocation algorithms.
arXiv Detail & Related papers (2023-12-13T09:39:32Z)
Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems [0.0]
InfAdapter proactively selects a set of ML model variants with their resource allocations to meet latency SLO. It decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler.
arXiv Detail & Related papers (2023-04-21T11:19:49Z)
RISCLESS: A Reinforcement Learning Strategy to Exploit Unused Cloud Resources [0.44634886884474834]
One of the main objectives of Cloud Providers (CPs) is to guarantee the Service-Level Agreement (SLA) of customers. This paper proposes RISCLESS, a Reinforcement Learning strategy to exploit unused Cloud resources.
arXiv Detail & Related papers (2022-04-28T06:49:24Z)
PROMPT: Learning Dynamic Resource Allocation Policies for Network Applications [16.812611987082082]
We propose PROMPT, a novel resource allocation framework using proactive prediction to guide a reinforcement learning controller. We show that PROMPT incurs 4.2x fewer violations, reduces severity of policy violations by 12.7x, improves best-effort workload performance, and improves overall power efficiency over prior work.
arXiv Detail & Related papers (2022-01-19T23:34:34Z)
An Intelligent Resource Reservation for Crowdsourced Live Video Streaming Applications in Geo-Distributed Cloud Environment [45.61165288624505]
We introduce a machine-learning based predictive resource allocation framework for geo-distributed cloud sites. First, we present an offline optimization that decides the required resources in distributed regions near the viewers. Second, we use machine learning to build forecasting models that proactively predict the resources to be reserved at each cloud site ahead of time.
arXiv Detail & Related papers (2021-06-04T11:45:09Z)
Coordinated Online Learning for Multi-Agent Systems with Coupled Constraints and Perturbed Utility Observations [91.02019381927236]
We introduce a novel method to steer the agents toward a stable population state, fulfilling the given resource constraints. The proposed method is a decentralized resource pricing method based on the resource loads resulting from the augmentation of the game's Lagrangian.
arXiv Detail & Related papers (2020-10-21T10:11:17Z)
A Predictive Autoscaler for Elastic Batch Jobs [8.354712625979776]
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service. We propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances.
arXiv Detail & Related papers (2020-10-10T17:35:55Z)
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors. Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)
Hierarchical Adaptive Contextual Bandits for Resource Constraint based Recommendation [49.69139684065241]
Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems. In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint.
arXiv Detail & Related papers (2020-04-02T17:04:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.