Decentralized Distributed Proximal Policy Optimization (DD-PPO) for High Performance Computing Scheduling on Multi-User Systems
- URL: http://arxiv.org/abs/2505.03946v1
- Date: Tue, 06 May 2025 19:50:37 GMT
- Title: Decentralized Distributed Proximal Policy Optimization (DD-PPO) for High Performance Computing Scheduling on Multi-User Systems
- Authors: Matthew Sgambati, Aleksandar Vakanski, Matthew Anderson,
- Abstract summary: This study introduces a novel RL-based scheduler utilizing the Decentralized Distributed Proximal Policy Optimization (DD-PPO) algorithm.<n>The DD-PPO algorithm supports large-scale distributed training across multiple workers without requiring parameter synchronization at every step.<n>The validation dataset leveraged over 11.5 million real HPC job traces for comparing DD-PPO performance between traditional and advanced scheduling approaches.
- Score: 45.62643537023675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Resource allocation in High Performance Computing (HPC) environments presents a complex and multifaceted challenge for job scheduling algorithms. Beyond the efficient allocation of system resources, schedulers must account for and optimize multiple performance metrics, including job wait time and system utilization. While traditional rule-based scheduling algorithms dominate the current deployments of HPC systems, the increasing heterogeneity and scale of those systems is expected to challenge the efficiency and flexibility of those algorithms in minimizing job wait time and maximizing utilization. Recent research efforts have focused on leveraging advancements in Reinforcement Learning (RL) to develop more adaptable and intelligent scheduling strategies. Recent RL-based scheduling approaches have explored a range of algorithms, from Deep Q-Networks (DQN) to Proximal Policy Optimization (PPO), and more recently, hybrid methods that integrate Graph Neural Networks with RL techniques. However, a common limitation across these methods is their reliance on relatively small datasets, and these methods face scalability issues when using large datasets. This study introduces a novel RL-based scheduler utilizing the Decentralized Distributed Proximal Policy Optimization (DD-PPO) algorithm, which supports large-scale distributed training across multiple workers without requiring parameter synchronization at every step. By eliminating reliance on centralized updates to a shared policy, the DD-PPO scheduler enhances scalability, training efficiency, and sample utilization. The validation dataset leveraged over 11.5 million real HPC job traces for comparing DD-PPO performance between traditional and advanced scheduling approaches, and the experimental results demonstrate improved scheduling performance in comparison to both rule-based schedulers and existing RL-based scheduling algorithms.
Related papers
- Research on Edge Computing and Cloud Collaborative Resource Scheduling Optimization Based on Deep Reinforcement Learning [11.657154571216234]
This study addresses the challenge of resource scheduling optimization in edge-cloud collaborative computing using deep reinforcement learning (DRL)<n>The proposed DRL-based approach improves task processing efficiency, reduces overall processing time, enhances resource utilization, and effectively controls task migrations.
arXiv Detail & Related papers (2025-02-26T03:05:11Z) - Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters [24.845122459974466]
This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm.<n>By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models, A-SRPT strategically assigns jobs to the available GPU.<n>A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy.
arXiv Detail & Related papers (2025-01-09T20:19:01Z) - Resource-Adaptive Successive Doubling for Hyperparameter Optimization with Large Datasets on High-Performance Computing Systems [0.4334105740533729]
This article proposes a novel Resource-Adaptive Successive Doubling Algorithm (RASDA)<n>It combines a resource-adaptive successive doubling scheme with the plain Asynchronous Successive Halving Algorithm (ASHA)<n>It is applied to different types of Neural Networks (NNs) and trained on large datasets from the Computer Vision (CV), Computational Fluid Dynamics (CFD, and Additive Manufacturing (AM) domains.
arXiv Detail & Related papers (2024-12-03T11:25:48Z) - Deep reinforcement learning for machine scheduling: Methodology, the
state-of-the-art, and future directions [2.4541568670428915]
Machine scheduling aims to optimize job assignments to machines while adhering to manufacturing rules and job specifications.
Deep Reinforcement Learning (DRL), a key component of artificial general intelligence, has shown promise in various domains like gaming and robotics.
This paper offers a comprehensive review and comparison of DRL-based approaches, highlighting their methodology, applications, advantages, and limitations.
arXiv Detail & Related papers (2023-10-04T22:45:09Z) - Dynamic Scheduling for Federated Edge Learning with Streaming Data [56.91063444859008]
We consider a Federated Edge Learning (FEEL) system where training data are randomly generated over time at a set of distributed edge devices with long-term energy constraints.
Due to limited communication resources and latency requirements, only a subset of devices is scheduled for participating in the local training process in every iteration.
arXiv Detail & Related papers (2023-05-02T07:41:16Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - A Memetic Algorithm with Reinforcement Learning for Sociotechnical
Production Scheduling [0.0]
This article presents a memetic algorithm with applying deep reinforcement learning (DRL) to flexible job shop scheduling problems (DRC-FJSSP)
From research projects in industry, we recognize the need to consider flexible machines, flexible human workers, worker capabilities, setup and processing operations, material arrival times, complex job paths with parallel tasks for bill of material manufacturing, sequence-dependent setup times and (partially) automated tasks in human-machine-collaboration.
arXiv Detail & Related papers (2022-12-21T11:24:32Z) - Semantic-Aware Collaborative Deep Reinforcement Learning Over Wireless
Cellular Networks [82.02891936174221]
Collaborative deep reinforcement learning (CDRL) algorithms in which multiple agents can coordinate over a wireless network is a promising approach.
In this paper, a novel semantic-aware CDRL method is proposed to enable a group of untrained agents with semantically-linked DRL tasks to collaborate efficiently across a resource-constrained wireless cellular network.
arXiv Detail & Related papers (2021-11-23T18:24:47Z) - A Two-stage Framework and Reinforcement Learning-based Optimization
Algorithms for Complex Scheduling Problems [54.61091936472494]
We develop a two-stage framework, in which reinforcement learning (RL) and traditional operations research (OR) algorithms are combined together.
The scheduling problem is solved in two stages, including a finite Markov decision process (MDP) and a mixed-integer programming process, respectively.
Results show that the proposed algorithms could stably and efficiently obtain satisfactory scheduling schemes for agile Earth observation satellite scheduling problems.
arXiv Detail & Related papers (2021-03-10T03:16:12Z) - Reinforcement Learning for Datacenter Congestion Control [50.225885814524304]
Successful congestion control algorithms can dramatically improve latency and overall network throughput.
Until today, no such learning-based algorithms have shown practical potential in this domain.
We devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks.
We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training.
arXiv Detail & Related papers (2021-02-18T13:49:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.