Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters
- URL: http://arxiv.org/abs/2010.15206v3
- Date: Wed, 27 Oct 2021 00:12:54 GMT
- Title: Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters
- Authors: Qiong Wu, Zhenming Liu
- Abstract summary: We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters.
Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time.
We evaluate Rosella with a variety of workloads on a 32-node AWS cluster.
- Score: 7.206919625027208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale interactive web services and advanced AI applications make
sophisticated decisions in real-time, based on executing a massive amount of
computation tasks on thousands of servers. Task schedulers, which often operate
in heterogeneous and volatile environments, require high throughput, i.e.,
scheduling millions of tasks per second, and low latency, i.e., incurring
minimal scheduling delays for millisecond-level tasks. Scheduling is further
complicated by other users' workloads in a shared system, other background
activities, and the diverse hardware configurations inside datacenters.
We present Rosella, a new self-driving, distributed approach for task
scheduling in heterogeneous clusters. Rosella automatically learns the compute
environment and adjusts its scheduling policy in real-time. The solution
provides high throughput and low latency simultaneously because it runs in
parallel on multiple machines with minimum coordination and only performs
simple operations for each scheduling decision. Our learning module monitors
total system load and uses the information to dynamically determine optimal
estimation strategy for the backends' compute-power. Rosella generalizes
power-of-two-choice algorithms to handle heterogeneous workers, reducing the
max queue length of O(log n) obtained by prior algorithms to O(log log n). We
evaluate Rosella with a variety of workloads on a 32-node AWS cluster.
Experimental results show that Rosella significantly reduces task response
time, and adapts to environment changes quickly.
Related papers
- Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs [3.7758841366694353]
We survey scheduling techniques from the literature and from practical serving systems.
We find that schedulers from the literature often achieve good performance but introduce significant complexity.
In contrast, schedulers in practical deployments often leave easy performance gains on the table but are easy to implement, deploy and configure.
arXiv Detail & Related papers (2024-10-23T13:05:46Z) - Toward Smart Scheduling in Tapis [1.0377683220196874]
We present our efforts to develop an intelligent job scheduling capability in Tapis.
We focus on one such specific challenge: predicting queue times for a job on different HPC systems and queues.
Our first set of results cast the problem as a regression, which can be used to select the best system from a list of existing options.
arXiv Detail & Related papers (2024-08-05T20:01:31Z) - Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices.
We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling.
Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Learning Coordination Policies over Heterogeneous Graphs for Human-Robot
Teams via Recurrent Neural Schedule Propagation [0.0]
HybridNet is a deep learning-based framework for scheduling human-robot teams.
We develop a virtual scheduling environment for mixed human-robot teams in a multiround setting.
arXiv Detail & Related papers (2023-01-30T20:42:06Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - GCNScheduler: Scheduling Distributed Computing Applications using Graph
Convolutional Networks [12.284934135116515]
We propose a graph convolutional network-based scheduler (GCNScheduler)
By carefully integrating an inter-task data dependency structure with network settings into an input graph, the GCNScheduler can efficiently schedule tasks for a given objective.
We show that it better makespan than the classic HEFT algorithm, and almost the same throughput as throughput-oriented HEFT.
arXiv Detail & Related papers (2021-10-22T01:54:10Z) - Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under
Delayed Feedback [29.177402567437206]
We present a partially observable (PO) model that captures the scheduling decisions in parallel queuing systems under limited information of delayed acknowledgements.
We numerically show that the resulting policy outperforms other limited information scheduling strategies.
We show how our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
arXiv Detail & Related papers (2021-09-17T13:45:02Z) - Better than the Best: Gradient-based Improper Reinforcement Learning for
Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay.
We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z) - Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems.
First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch.
Second, for diverse system scales and structures, we use graph neural networks to embed system state information.
Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z) - Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal
Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination.
We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.