RAPID: Enabling Fast Online Policy Learning in Dynamic Public Cloud
Environments
- URL: http://arxiv.org/abs/2304.04797v2
- Date: Mon, 4 Sep 2023 01:24:10 GMT
- Title: RAPID: Enabling Fast Online Policy Learning in Dynamic Public Cloud
Environments
- Authors: Drew Penney, Bin Li, Lizhong Chen, Jaroslaw J. Sydir, Anna
Drewek-Ossowicka, Ramesh Illikkal, Charlie Tai, Ravi Iyer, Andrew Herdrich
- Abstract summary: We propose a novel framework for fast, fully-online resource allocation policy learning in dynamic operating environments.
We show that our framework can learn stable resource allocation policies in minutes, as compared with hours in prior state-of-the-art.
- Score: 7.825552412435501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Resource sharing between multiple workloads has become a prominent practice
among cloud service providers, motivated by demand for improved resource
utilization and reduced cost of ownership. Effective resource sharing, however,
remains an open challenge due to the adverse effects that resource contention
can have on high-priority, user-facing workloads with strict Quality of Service
(QoS) requirements. Although recent approaches have demonstrated promising
results, those works remain largely impractical in public cloud environments
since workloads are not known in advance and may only run for a brief period,
thus prohibiting offline learning and significantly hindering online learning.
In this paper, we propose RAPID, a novel framework for fast, fully-online
resource allocation policy learning in highly dynamic operating environments.
RAPID leverages lightweight QoS predictions, enabled by
domain-knowledge-inspired techniques for sample efficiency and bias reduction,
to decouple control from conventional feedback sources and guide policy
learning at a rate orders of magnitude faster than prior work. Evaluation on a
real-world server platform with representative cloud workloads confirms that
RAPID can learn stable resource allocation policies in minutes, as compared
with hours in prior state-of-the-art, while improving QoS by 9.0x and
increasing best-effort workload performance by 19-43%.
Related papers
- Topology-aware Preemptive Scheduling for Co-located LLM Workloads [7.240168647854797]
We develop a fine-grained topology-aware method for scheduling of hybrid workloads.
This method significantly increases the efficiency of preemption and improves overall scheduled performance for LLM workloads by $55%$.
arXiv Detail & Related papers (2024-11-18T13:26:09Z) - Reinforcement Learning-Based Adaptive Load Balancing for Dynamic Cloud Environments [0.0]
We propose a novel adaptive load balancing framework using Reinforcement Learning (RL) to address these challenges.
Our framework is designed to dynamically reallocate tasks to minimize latency and ensure balanced resource usage across servers.
Experimental results show that the proposed RL-based load balancer outperforms traditional algorithms in terms of response time, resource utilization, and adaptability to changing workloads.
arXiv Detail & Related papers (2024-09-07T19:40:48Z) - An Advanced Reinforcement Learning Framework for Online Scheduling of Deferrable Workloads in Cloud Computing [37.457951933256055]
We propose an online deferrable job scheduling method called textitOnline Scheduling for DEferrable jobs in Cloud (OSDEC), where a deep reinforcement learning model is adopted to learn the scheduling policy.
The proposed method can well plan the deployment schedule and achieve a short waiting time for users while maintaining a high resource utilization for the platform.
arXiv Detail & Related papers (2024-06-03T06:55:26Z) - Small Dataset, Big Gains: Enhancing Reinforcement Learning by Offline
Pre-Training with Model Based Augmentation [59.899714450049494]
offline pre-training can produce sub-optimal policies and lead to degraded online reinforcement learning performance.
We propose a model-based data augmentation strategy to maximize the benefits of offline reinforcement learning pre-training and reduce the scale of data needed to be effective.
arXiv Detail & Related papers (2023-12-15T14:49:41Z) - Adaptive Resource Allocation for Virtualized Base Stations in O-RAN with
Online Learning [60.17407932691429]
Open Radio Access Network systems, with their base stations (vBSs), offer operators the benefits of increased flexibility, reduced costs, vendor diversity, and interoperability.
We propose an online learning algorithm that balances the effective throughput and vBS energy consumption, even under unforeseeable and "challenging'' environments.
We prove the proposed solutions achieve sub-linear regret, providing zero average optimality gap even in challenging environments.
arXiv Detail & Related papers (2023-09-04T17:30:21Z) - The Cost of Learning: Efficiency vs. Efficacy of Learning-Based RRM for
6G [10.28841351455586]
Deep Reinforcement Learning (DRL) has become a valuable solution to automatically learn efficient resource management strategies in complex networks.
In many scenarios, the learning task is performed in the Cloud, while experience samples are generated directly by edge nodes or users.
This creates a friction between the need to speed up convergence towards an effective strategy, which requires the allocation of resources to transmit learning samples.
We propose a dynamic balancing strategy between the learning and data planes, which allows the centralized learning agent to quickly converge to an efficient resource allocation strategy.
arXiv Detail & Related papers (2022-11-30T11:26:01Z) - Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster.
By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z) - PROMPT: Learning Dynamic Resource Allocation Policies for Network
Applications [16.812611987082082]
We propose PROMPT, a novel resource allocation framework using proactive prediction to guide a reinforcement learning controller.
We show that PROMPT incurs 4.2x fewer violations, reduces severity of policy violations by 12.7x, improves best-effort workload performance, and improves overall power efficiency over prior work.
arXiv Detail & Related papers (2022-01-19T23:34:34Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - Coordinated Online Learning for Multi-Agent Systems with Coupled
Constraints and Perturbed Utility Observations [91.02019381927236]
We introduce a novel method to steer the agents toward a stable population state, fulfilling the given resource constraints.
The proposed method is a decentralized resource pricing method based on the resource loads resulting from the augmentation of the game's Lagrangian.
arXiv Detail & Related papers (2020-10-21T10:11:17Z) - Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors.
Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.