A Scalable Deep Reinforcement Learning Model for Online Scheduling
Coflows of Multi-Stage Jobs for High Performance Computing
- URL: http://arxiv.org/abs/2112.11055v1
- Date: Tue, 21 Dec 2021 09:36:55 GMT
- Title: A Scalable Deep Reinforcement Learning Model for Online Scheduling
Coflows of Multi-Stage Jobs for High Performance Computing
- Authors: Xin Wang and Hong Shen
- Abstract summary: In multi-stage jobs, each job consists of multiple coflows and is represented by a Directed Acyclic Graph (DAG)
In this paper, we propose a novel Pipelined-DAGNN to process the input and propose a novel coflow scheduling algorithm.
- Score: 9.866286878494979
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coflow is a recently proposed networking abstraction to help improve the
communication performance of data-parallel computing jobs. In multi-stage jobs,
each job consists of multiple coflows and is represented by a Directed Acyclic
Graph (DAG). Efficiently scheduling coflows is critical to improve the
data-parallel computing performance in data centers. Compared with hand-tuned
scheduling heuristics, existing work DeepWeave [1] utilizes Reinforcement
Learning (RL) framework to generate highly-efficient coflow scheduling policies
automatically. It employs a graph neural network (GNN) to encode the job
information in a set of embedding vectors, and feeds a flat embedding vector
containing the whole job information to the policy network. However, this
method has poor scalability as it is unable to cope with jobs represented by
DAGs of arbitrary sizes and shapes, which requires a large policy network for
processing a high-dimensional embedding vector that is difficult to train. In
this paper, we first utilize a directed acyclic graph neural network (DAGNN) to
process the input and propose a novel Pipelined-DAGNN, which can effectively
speed up the feature extraction process of the DAGNN. Next, we feed the
embedding sequence composed of schedulable coflows instead of a flat embedding
of all coflows to the policy network, and output a priority sequence, which
makes the size of the policy network depend on only the dimension of features
instead of the product of dimension and number of nodes in the job's
DAG.Furthermore, to improve the accuracy of the priority scheduling policy, we
incorporate the Self-Attention Mechanism into a deep RL model to capture the
interaction between different parts of the embedding sequence to make the
output priority scores relevant. Based on this model, we then develop a coflow
scheduling algorithm for online multi-stage jobs.
Related papers
- Online Parallel Multi-Task Relationship Learning via Alternating Direction Method of Multipliers [37.859185005986056]
Online multi-task learning (OMTL) enhances streaming data processing by leveraging the inherent relations among multiple tasks.
This study proposes a novel OMTL framework based on the alternating direction multiplier method (ADMM), a recent breakthrough in optimization suitable for the distributed computing environment.
arXiv Detail & Related papers (2024-11-09T10:20:13Z) - Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.
We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.
We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - Edge Generation Scheduling for DAG Tasks Using Deep Reinforcement
Learning [2.365237699556817]
Directed acyclic graph (DAG) tasks are currently adopted in the real-time domain to model complex applications.
We propose a new DAG scheduling framework that attempts to minimize the DAG width by iteratively generating edges.
We evaluate the effectiveness of the proposed algorithm by comparing it with state-of-the-art DAG schedulings and an optimal mixed-integer linear programming baseline.
arXiv Detail & Related papers (2023-08-28T15:19:18Z) - Scheduling Inference Workloads on Distributed Edge Clusters with
Reinforcement Learning [11.007816552466952]
This paper focuses on the problem of scheduling inference queries on Deep Neural Networks in edge networks at short timescales.
By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP.
We design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions.
arXiv Detail & Related papers (2023-01-31T13:23:34Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Multi-objective Optimization of Clustering-based Scheduling for
Multi-workflow On Clouds Considering Fairness [4.021507306414546]
This paper defines a new multi-objective optimization model based on makespan, cost, and fairness, and then proposes a global clustering-based multi-workflow scheduling strategy for resource allocation.
Experimental results show that the proposed approach performs better than the compared algorithms without significant compromise of the overall makespan and cost as well as individual fairness.
arXiv Detail & Related papers (2022-05-23T10:25:16Z) - JUMBO: Scalable Multi-task Bayesian Optimization using Offline Data [86.8949732640035]
We propose JUMBO, an MBO algorithm that sidesteps limitations by querying additional data.
We show that it achieves no-regret under conditions analogous to GP-UCB.
Empirically, we demonstrate significant performance improvements over existing approaches on two real-world optimization problems.
arXiv Detail & Related papers (2021-06-02T05:03:38Z) - Better than the Best: Gradient-based Improper Reinforcement Learning for
Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay.
We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z) - Deep Reinforcement Learning for Resource Constrained Multiclass
Scheduling in Wireless Networks [0.0]
In our setup, the available limited bandwidth resources are allocated in order to serve randomly arriving service demands.
We propose a distributional Deep Deterministic Policy Gradient (DDPG) algorithm combined with Deep Sets to tackle the problem.
Our proposed algorithm is tested on both synthetic and real data, showing consistent gains against state-of-the-art conventional methods.
arXiv Detail & Related papers (2020-11-27T09:49:38Z) - Policy-GNN: Aggregation Optimization for Graph Neural Networks [60.50932472042379]
Graph neural networks (GNNs) aim to model the local graph structures and capture the hierarchical patterns by aggregating the information from neighbors.
It is a challenging task to develop an effective aggregation strategy for each node, given complex graphs and sparse features.
We propose Policy-GNN, a meta-policy framework that models the sampling procedure and message passing of GNNs into a combined learning process.
arXiv Detail & Related papers (2020-06-26T17:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.