Deep Reinforcement Agent for Scheduling in HPC
- URL: http://arxiv.org/abs/2102.06243v1
- Date: Thu, 11 Feb 2021 20:08:38 GMT
- Title: Deep Reinforcement Agent for Scheduling in HPC
- Authors: Yuping Fan, Zhiling Lan, Taylor Childers, Paul Rich, William Allcock
and Michael E. Papka
- Abstract summary: Cluster scheduler determines when and which user jobs should be allocated to available system resources.
In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning.
- Score: 1.6569798882223303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cluster scheduler is crucial in high-performance computing (HPC). It
determines when and which user jobs should be allocated to available system
resources. Existing cluster scheduling heuristics are developed by human
experts based on their experience with specific HPC systems and workloads.
However, the increasing complexity of computing systems and the highly dynamic
nature of application workloads have placed tremendous burden on manually
designed and tuned scheduling heuristics. More aggressive optimization and
automation are needed for cluster scheduling in HPC. In this work, we present
an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for
Scheduling) by leveraging deep reinforcement learning. DRAS is built on a
novel, hierarchical neural network incorporating special HPC scheduling
features such as resource reservation and backfilling. A unique training
strategy is presented to enable DRAS to rapidly learn the target environment.
Once being provided a specific scheduling objective given by system manager,
DRAS automatically learns to improve its policy through interaction with the
scheduling environment and dynamically adjusts its policy as workload changes.
The experiments with different production workloads demonstrate that DRAS
outperforms the existing heuristic and optimization approaches by up to 45%.
Related papers
- Reinforcement Learning for Adaptive Resource Scheduling in Complex System Environments [8.315191578007857]
This study presents a novel computer system performance optimization and adaptive workload management scheduling algorithm based on Q-learning.
By contrast, Q-learning, a reinforcement learning algorithm, continuously learns from system state changes, enabling dynamic scheduling and resource optimization.
This research provides a foundation for the integration of AI-driven adaptive scheduling in future large-scale systems, offering a scalable, intelligent solution to enhance system performance, reduce operating costs, and support sustainable energy consumption.
arXiv Detail & Related papers (2024-11-08T05:58:09Z) - Learning Logic Specifications for Policy Guidance in POMDPs: an
Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver.
We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications.
We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z) - Dynamic Scheduling for Federated Edge Learning with Streaming Data [56.91063444859008]
We consider a Federated Edge Learning (FEEL) system where training data are randomly generated over time at a set of distributed edge devices with long-term energy constraints.
Due to limited communication resources and latency requirements, only a subset of devices is scheduled for participating in the local training process in every iteration.
arXiv Detail & Related papers (2023-05-02T07:41:16Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - A Memetic Algorithm with Reinforcement Learning for Sociotechnical
Production Scheduling [0.0]
This article presents a memetic algorithm with applying deep reinforcement learning (DRL) to flexible job shop scheduling problems (DRC-FJSSP)
From research projects in industry, we recognize the need to consider flexible machines, flexible human workers, worker capabilities, setup and processing operations, material arrival times, complex job paths with parallel tasks for bill of material manufacturing, sequence-dependent setup times and (partially) automated tasks in human-machine-collaboration.
arXiv Detail & Related papers (2022-12-21T11:24:32Z) - HARL: Hierarchical Adaptive Reinforcement Learning Based Auto Scheduler
for Neural Networks [51.71682428015139]
We propose HARL, a reinforcement learning-based auto-scheduler for efficient tensor program exploration.
HarL improves the tensor operator performance by 22% and the search speed by 4.3x compared to the state-of-the-art auto-scheduler.
Inference performance and search speed are also significantly improved on end-to-end neural networks.
arXiv Detail & Related papers (2022-11-21T04:15:27Z) - DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster
Scheduling [0.9529163786034884]
We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy.
DRAS-CQSim encapsulates simulation environments, agents, hyper parameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.
arXiv Detail & Related papers (2021-05-16T21:56:31Z) - Smart Scheduling based on Deep Reinforcement Learning for Cellular
Networks [18.04856086228028]
We propose a smart scheduling scheme based on deep reinforcement learning (DRL)
We provide implementation-friend designs, i.e., a scalable neural network design for the agent and a virtual environment training framework.
We show that the DRL-based smart scheduling outperforms the conventional scheduling method and can be adopted in practical systems.
arXiv Detail & Related papers (2021-03-22T02:09:16Z) - Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems.
First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch.
Second, for diverse system scales and structures, we use graph neural networks to embed system state information.
Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z) - Online Reinforcement Learning Control by Direct Heuristic Dynamic
Programming: from Time-Driven to Event-Driven [80.94390916562179]
Time-driven learning refers to the machine learning method that updates parameters in a prediction model continuously as new data arrives.
It is desirable to prevent the time-driven dHDP from updating due to insignificant system event such as noise.
We show how the event-driven dHDP algorithm works in comparison to the original time-driven dHDP.
arXiv Detail & Related papers (2020-06-16T05:51:25Z) - DeepSoCS: A Neural Scheduler for Heterogeneous System-on-Chip (SoC)
Resource Scheduling [0.0]
We present a novel scheduling solution for a class of System-on-Chip (SoC) systems.
Our Deep Reinforcement Learning (DRL)-based Scheduler (DeepSoCS) overcomes the brittleness of rule-based schedulers.
arXiv Detail & Related papers (2020-05-15T17:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.