DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster
Scheduling
- URL: http://arxiv.org/abs/2105.07526v1
- Date: Sun, 16 May 2021 21:56:31 GMT
- Title: DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster
Scheduling
- Authors: Yuping Fan and Zhiling Lan
- Abstract summary: We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy.
DRAS-CQSim encapsulates simulation environments, agents, hyper parameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.
- Score: 0.9529163786034884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For decades, system administrators have been striving to design and tune
cluster scheduling policies to improve the performance of high performance
computing (HPC) systems. However, the increasingly complex HPC systems combined
with highly diverse workloads make such manual process challenging,
time-consuming, and error-prone. We present a reinforcement learning based HPC
scheduling framework named DRAS-CQSim to automatically learn optimal scheduling
policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter
tuning options, and different reinforcement learning algorithms, which allows
the system administrators to quickly obtain customized scheduling policies.
Related papers
- Reinforcement Learning for Adaptive Resource Scheduling in Complex System Environments [8.315191578007857]
This study presents a novel computer system performance optimization and adaptive workload management scheduling algorithm based on Q-learning.
By contrast, Q-learning, a reinforcement learning algorithm, continuously learns from system state changes, enabling dynamic scheduling and resource optimization.
This research provides a foundation for the integration of AI-driven adaptive scheduling in future large-scale systems, offering a scalable, intelligent solution to enhance system performance, reduce operating costs, and support sustainable energy consumption.
arXiv Detail & Related papers (2024-11-08T05:58:09Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion
Control in Real Networks [63.24965775030673]
We propose a novel Reinforcement Learning (RL) approach to design generic Congestion Control (CC) algorithms.
Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return.
We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch.
arXiv Detail & Related papers (2023-02-02T18:27:20Z) - Distributed-Training-and-Execution Multi-Agent Reinforcement Learning
for Power Control in HetNet [48.96004919910818]
We propose a multi-agent deep reinforcement learning (MADRL) based power control scheme for the HetNet.
To promote cooperation among agents, we develop a penalty-based Q learning (PQL) algorithm for MADRL systems.
In this way, an agent's policy can be learned by other agents more easily, resulting in a more efficient collaboration process.
arXiv Detail & Related papers (2022-12-15T17:01:56Z) - Multi-level Explanation of Deep Reinforcement Learning-based Scheduling [3.043569093713764]
Dependency-aware job scheduling in the cluster is NP-hard.
Recent work shows that Deep Reinforcement Learning (DRL) is capable of solving it.
In this paper, we give the multi-level explanation framework to interpret the policy of DRL-based scheduling.
arXiv Detail & Related papers (2022-09-18T13:22:53Z) - MCDS: AI Augmented Workflow Scheduling in Mobile Edge Cloud Computing
Systems [12.215537834860699]
Recently proposed scheduling methods leverage the low response times of edge computing platforms to optimize application Quality of Service (QoS)
We propose MCDS: Monte Carlo Learning using Deep Surrogate Models to efficiently schedule workflow applications in mobile edge-cloud computing systems.
arXiv Detail & Related papers (2021-12-14T10:00:01Z) - Better than the Best: Gradient-based Improper Reinforcement Learning for
Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay.
We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z) - Deep Reinforcement Agent for Scheduling in HPC [1.6569798882223303]
Cluster scheduler determines when and which user jobs should be allocated to available system resources.
In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning.
arXiv Detail & Related papers (2021-02-11T20:08:38Z) - Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems.
First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch.
Second, for diverse system scales and structures, we use graph neural networks to embed system state information.
Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z) - Online Reinforcement Learning Control by Direct Heuristic Dynamic
Programming: from Time-Driven to Event-Driven [80.94390916562179]
Time-driven learning refers to the machine learning method that updates parameters in a prediction model continuously as new data arrives.
It is desirable to prevent the time-driven dHDP from updating due to insignificant system event such as noise.
We show how the event-driven dHDP algorithm works in comparison to the original time-driven dHDP.
arXiv Detail & Related papers (2020-06-16T05:51:25Z) - DeepSoCS: A Neural Scheduler for Heterogeneous System-on-Chip (SoC)
Resource Scheduling [0.0]
We present a novel scheduling solution for a class of System-on-Chip (SoC) systems.
Our Deep Reinforcement Learning (DRL)-based Scheduler (DeepSoCS) overcomes the brittleness of rule-based schedulers.
arXiv Detail & Related papers (2020-05-15T17:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.