Related papers: PROMPT: Learning Dynamic Resource Allocation Policies for Network Applications

PROMPT: Learning Dynamic Resource Allocation Policies for Network Applications

URL: http://arxiv.org/abs/2201.07916v2
Date: Sat, 25 Mar 2023 01:07:57 GMT
Title: PROMPT: Learning Dynamic Resource Allocation Policies for Network Applications
Authors: Drew Penney, Bin Li, Jaroslaw Sydir, Lizhong Chen, Charlie Tai, Stefan Lee, Eoin Walsh, Thomas Long
Abstract summary: We propose PROMPT, a novel resource allocation framework using proactive prediction to guide a reinforcement learning controller. We show that PROMPT incurs 4.2x fewer violations, reduces severity of policy violations by 12.7x, improves best-effort workload performance, and improves overall power efficiency over prior work.
Score: 16.812611987082082
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: A growing number of service providers are exploring methods to improve server utilization and reduce power consumption by co-scheduling high-priority latency-critical workloads with best-effort workloads. This practice requires strict resource allocation between workloads to reduce contention and maintain Quality-of-Service (QoS) guarantees. Prior work demonstrated promising opportunities to dynamically allocate resources based on workload demand, but may fail to meet QoS objectives in more stringent operating environments due to the presence of resource allocation cliffs, transient fluctuations in workload performance, and rapidly changing resource demand. We therefore propose PROMPT, a novel resource allocation framework using proactive QoS prediction to guide a reinforcement learning controller. PROMPT enables more precise resource optimization, more consistent handling of transient behaviors, and more robust generalization when co-scheduling new best-effort workloads not encountered during policy training. Evaluation shows that the proposed method incurs 4.2x fewer QoS violations, reduces severity of QoS violations by 12.7x, improves best-effort workload performance, and improves overall power efficiency over prior work.

Related papers

Network Resource Optimization for ML-Based UAV Condition Monitoring with Vibration Analysis [54.550658461477106]
Condition Monitoring (CM) uses Machine Learning (ML) models to identify abnormal and adverse conditions. This work explores the optimization of network resources for ML-based UAV CM frameworks. By leveraging dimensionality reduction techniques, there is a 99.9% reduction in network resource consumption.
arXiv Detail & Related papers (2025-02-21T14:36:12Z)
AI-Driven Resource Allocation Framework for Microservices in Hybrid Cloud Platforms [1.03590082373586]
This paper presents an AI-driven framework for resource allocation among in hybrid cloud platforms. The framework employs reinforcement learning (RL)-based resource utilization optimization to reduce costs and improve performance.
arXiv Detail & Related papers (2024-12-03T17:41:08Z)
Topology-aware Preemptive Scheduling for Co-located LLM Workloads [7.240168647854797]
We develop a fine-grained topology-aware method for scheduling of hybrid workloads. This method significantly increases the efficiency of preemption and improves overall scheduled performance for LLM workloads by $55%$.
arXiv Detail & Related papers (2024-11-18T13:26:09Z)
Meta-RTL: Reinforcement-Based Meta-Transfer Learning for Low-Resource Commonsense Reasoning [61.8360232713375]
We propose a reinforcement-based multi-source meta-transfer learning framework (Meta-RTL) for low-resource commonsense reasoning. We present a reinforcement-based approach to dynamically estimating source task weights that measure the contribution of the corresponding tasks to the target task in the meta-transfer learning. Experimental results demonstrate that Meta-RTL substantially outperforms strong baselines and previous task selection strategies.
arXiv Detail & Related papers (2024-09-27T18:22:22Z)
Reinforcement Learning-Based Adaptive Load Balancing for Dynamic Cloud Environments [0.0]
We propose a novel adaptive load balancing framework using Reinforcement Learning (RL) to address these challenges. Our framework is designed to dynamically reallocate tasks to minimize latency and ensure balanced resource usage across servers. Experimental results show that the proposed RL-based load balancer outperforms traditional algorithms in terms of response time, resource utilization, and adaptability to changing workloads.
arXiv Detail & Related papers (2024-09-07T19:40:48Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO) Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Multi-Level ML Based Burst-Aware Autoscaling for SLO Assurance and Cost Efficiency [3.5624365288866007]
This paper introduces BAScaler, a Burst-Aware Autoscaling framework for containerized cloud services or applications under complex workloads. BAScaler incorporates a novel prediction-based burst detection mechanism that distinguishes between predictable periodic workload spikes and actual bursts.
arXiv Detail & Related papers (2024-02-20T12:28:25Z)
RAPID: Enabling Fast Online Policy Learning in Dynamic Public Cloud Environments [7.825552412435501]
We propose a novel framework for fast, fully-online resource allocation policy learning in dynamic operating environments. We show that our framework can learn stable resource allocation policies in minutes, as compared with hours in prior state-of-the-art.
arXiv Detail & Related papers (2023-04-10T18:04:39Z)
Guaranteed Dynamic Scheduling of Ultra-Reliable Low-Latency Traffic via Conformal Prediction [72.59079526765487]
The dynamic scheduling of ultra-reliable and low-latency traffic (URLLC) in the uplink can significantly enhance the efficiency of coexisting services. The main challenge is posed by the uncertainty in the process of URLLC packet generation. We introduce a novel scheduler for URLLC packets that provides formal guarantees on reliability and latency irrespective of the quality of the URLLC traffic predictor.
arXiv Detail & Related papers (2023-02-15T14:09:55Z)
Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster. By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z)
Coordinated Online Learning for Multi-Agent Systems with Coupled Constraints and Perturbed Utility Observations [91.02019381927236]
We introduce a novel method to steer the agents toward a stable population state, fulfilling the given resource constraints. The proposed method is a decentralized resource pricing method based on the resource loads resulting from the augmentation of the game's Lagrangian.
arXiv Detail & Related papers (2020-10-21T10:11:17Z)
ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources [2.205500582481277]
We propose a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. Our solution reduces significantly the SLA violation penalties on average by 2.7x and up to 3.4x. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.
arXiv Detail & Related papers (2020-09-23T15:19:28Z)
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors. Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.