Deep Reinforcement Agent for Scheduling in HPC
- URL: http://arxiv.org/abs/2102.06243v1
- Date: Thu, 11 Feb 2021 20:08:38 GMT
- Title: Deep Reinforcement Agent for Scheduling in HPC
- Authors: Yuping Fan, Zhiling Lan, Taylor Childers, Paul Rich, William Allcock
and Michael E. Papka
- Abstract summary: Cluster scheduler determines when and which user jobs should be allocated to available system resources.
In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning.
- Score: 1.6569798882223303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cluster scheduler is crucial in high-performance computing (HPC). It
determines when and which user jobs should be allocated to available system
resources. Existing cluster scheduling heuristics are developed by human
experts based on their experience with specific HPC systems and workloads.
However, the increasing complexity of computing systems and the highly dynamic
nature of application workloads have placed tremendous burden on manually
designed and tuned scheduling heuristics. More aggressive optimization and
automation are needed for cluster scheduling in HPC. In this work, we present
an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for
Scheduling) by leveraging deep reinforcement learning. DRAS is built on a
novel, hierarchical neural network incorporating special HPC scheduling
features such as resource reservation and backfilling. A unique training
strategy is presented to enable DRAS to rapidly learn the target environment.
Once being provided a specific scheduling objective given by system manager,
DRAS automatically learns to improve its policy through interaction with the
scheduling environment and dynamically adjusts its policy as workload changes.
The experiments with different production workloads demonstrate that DRAS
outperforms the existing heuristic and optimization approaches by up to 45%.
Related papers
- Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters [24.845122459974466]
This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm.
By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models, A-SRPT strategically assigns jobs to the available GPU.
A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy.
arXiv Detail & Related papers (2025-01-09T20:19:01Z) - Cluster-Based Multi-Agent Task Scheduling for Space-Air-Ground Integrated Networks [60.085771314013044]
Low-altitude economy holds significant potential for development in areas such as communication and sensing.
We propose a Clustering-based Multi-agent Deep Deterministic Policy Gradient (CMADDPG) algorithm to address the multi-UAV cooperative task scheduling challenges in SAGIN.
arXiv Detail & Related papers (2024-12-14T06:17:33Z) - Reinforcement Learning for Adaptive Resource Scheduling in Complex System Environments [8.315191578007857]
This study presents a novel computer system performance optimization and adaptive workload management scheduling algorithm based on Q-learning.
By contrast, Q-learning, a reinforcement learning algorithm, continuously learns from system state changes, enabling dynamic scheduling and resource optimization.
This research provides a foundation for the integration of AI-driven adaptive scheduling in future large-scale systems, offering a scalable, intelligent solution to enhance system performance, reduce operating costs, and support sustainable energy consumption.
arXiv Detail & Related papers (2024-11-08T05:58:09Z) - Learning Logic Specifications for Policy Guidance in POMDPs: an
Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver.
We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications.
We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z) - Dynamic Scheduling for Federated Edge Learning with Streaming Data [56.91063444859008]
We consider a Federated Edge Learning (FEEL) system where training data are randomly generated over time at a set of distributed edge devices with long-term energy constraints.
Due to limited communication resources and latency requirements, only a subset of devices is scheduled for participating in the local training process in every iteration.
arXiv Detail & Related papers (2023-05-02T07:41:16Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - HARL: Hierarchical Adaptive Reinforcement Learning Based Auto Scheduler
for Neural Networks [51.71682428015139]
We propose HARL, a reinforcement learning-based auto-scheduler for efficient tensor program exploration.
HarL improves the tensor operator performance by 22% and the search speed by 4.3x compared to the state-of-the-art auto-scheduler.
Inference performance and search speed are also significantly improved on end-to-end neural networks.
arXiv Detail & Related papers (2022-11-21T04:15:27Z) - DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster
Scheduling [0.9529163786034884]
We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy.
DRAS-CQSim encapsulates simulation environments, agents, hyper parameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.
arXiv Detail & Related papers (2021-05-16T21:56:31Z) - Smart Scheduling based on Deep Reinforcement Learning for Cellular
Networks [18.04856086228028]
We propose a smart scheduling scheme based on deep reinforcement learning (DRL)
We provide implementation-friend designs, i.e., a scalable neural network design for the agent and a virtual environment training framework.
We show that the DRL-based smart scheduling outperforms the conventional scheduling method and can be adopted in practical systems.
arXiv Detail & Related papers (2021-03-22T02:09:16Z) - Online Reinforcement Learning Control by Direct Heuristic Dynamic
Programming: from Time-Driven to Event-Driven [80.94390916562179]
Time-driven learning refers to the machine learning method that updates parameters in a prediction model continuously as new data arrives.
It is desirable to prevent the time-driven dHDP from updating due to insignificant system event such as noise.
We show how the event-driven dHDP algorithm works in comparison to the original time-driven dHDP.
arXiv Detail & Related papers (2020-06-16T05:51:25Z) - DeepSoCS: A Neural Scheduler for Heterogeneous System-on-Chip (SoC)
Resource Scheduling [0.0]
We present a novel scheduling solution for a class of System-on-Chip (SoC) systems.
Our Deep Reinforcement Learning (DRL)-based Scheduler (DeepSoCS) overcomes the brittleness of rule-based schedulers.
arXiv Detail & Related papers (2020-05-15T17:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.