Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters
- URL: http://arxiv.org/abs/2010.15206v3
- Date: Wed, 27 Oct 2021 00:12:54 GMT
- Title: Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters
- Authors: Qiong Wu, Zhenming Liu
- Abstract summary: We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters.
Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time.
We evaluate Rosella with a variety of workloads on a 32-node AWS cluster.
- Score: 7.206919625027208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale interactive web services and advanced AI applications make
sophisticated decisions in real-time, based on executing a massive amount of
computation tasks on thousands of servers. Task schedulers, which often operate
in heterogeneous and volatile environments, require high throughput, i.e.,
scheduling millions of tasks per second, and low latency, i.e., incurring
minimal scheduling delays for millisecond-level tasks. Scheduling is further
complicated by other users' workloads in a shared system, other background
activities, and the diverse hardware configurations inside datacenters.
We present Rosella, a new self-driving, distributed approach for task
scheduling in heterogeneous clusters. Rosella automatically learns the compute
environment and adjusts its scheduling policy in real-time. The solution
provides high throughput and low latency simultaneously because it runs in
parallel on multiple machines with minimum coordination and only performs
simple operations for each scheduling decision. Our learning module monitors
total system load and uses the information to dynamically determine optimal
estimation strategy for the backends' compute-power. Rosella generalizes
power-of-two-choice algorithms to handle heterogeneous workers, reducing the
max queue length of O(log n) obtained by prior algorithms to O(log log n). We
evaluate Rosella with a variety of workloads on a 32-node AWS cluster.
Experimental results show that Rosella significantly reduces task response
time, and adapts to environment changes quickly.
Related papers
- Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices.
We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling.
Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Scheduling Inference Workloads on Distributed Edge Clusters with
Reinforcement Learning [11.007816552466952]
This paper focuses on the problem of scheduling inference queries on Deep Neural Networks in edge networks at short timescales.
By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP.
We design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions.
arXiv Detail & Related papers (2023-01-31T13:23:34Z) - Learning Coordination Policies over Heterogeneous Graphs for Human-Robot
Teams via Recurrent Neural Schedule Propagation [0.0]
HybridNet is a deep learning-based framework for scheduling human-robot teams.
We develop a virtual scheduling environment for mixed human-robot teams in a multiround setting.
arXiv Detail & Related papers (2023-01-30T20:42:06Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Scheduling Servers with Stochastic Bilinear Rewards [3.5408022972081685]
We study a multi-class, multi-server queueing system with rewards of job-server assignments following a bilinear model in feature vectors representing jobs and servers.
We propose a scheduling algorithm that uses a linear bandit algorithm along with dynamic allocation of jobs to servers.
arXiv Detail & Related papers (2021-12-13T00:37:20Z) - GCNScheduler: Scheduling Distributed Computing Applications using Graph
Convolutional Networks [12.284934135116515]
We propose a graph convolutional network-based scheduler (GCNScheduler)
By carefully integrating an inter-task data dependency structure with network settings into an input graph, the GCNScheduler can efficiently schedule tasks for a given objective.
We show that it better makespan than the classic HEFT algorithm, and almost the same throughput as throughput-oriented HEFT.
arXiv Detail & Related papers (2021-10-22T01:54:10Z) - Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under
Delayed Feedback [29.177402567437206]
We present a partially observable (PO) model that captures the scheduling decisions in parallel queuing systems under limited information of delayed acknowledgements.
We numerically show that the resulting policy outperforms other limited information scheduling strategies.
We show how our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
arXiv Detail & Related papers (2021-09-17T13:45:02Z) - Better than the Best: Gradient-based Improper Reinforcement Learning for
Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay.
We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z) - Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems.
First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch.
Second, for diverse system scales and structures, we use graph neural networks to embed system state information.
Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z) - Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal
Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination.
We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.