Related papers: Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters

Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters

URL: http://arxiv.org/abs/2010.15206v3
Date: Wed, 27 Oct 2021 00:12:54 GMT
Title: Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters
Authors: Qiong Wu, Zhenming Liu
Abstract summary: We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time. We evaluate Rosella with a variety of workloads on a 32-node AWS cluster.
Score: 7.206919625027208
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale interactive web services and advanced AI applications make sophisticated decisions in real-time, based on executing a massive amount of computation tasks on thousands of servers. Task schedulers, which often operate in heterogeneous and volatile environments, require high throughput, i.e., scheduling millions of tasks per second, and low latency, i.e., incurring minimal scheduling delays for millisecond-level tasks. Scheduling is further complicated by other users' workloads in a shared system, other background activities, and the diverse hardware configurations inside datacenters. We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time. The solution provides high throughput and low latency simultaneously because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision. Our learning module monitors total system load and uses the information to dynamically determine optimal estimation strategy for the backends' compute-power. Rosella generalizes power-of-two-choice algorithms to handle heterogeneous workers, reducing the max queue length of O(log n) obtained by prior algorithms to O(log log n). We evaluate Rosella with a variety of workloads on a 32-node AWS cluster. Experimental results show that Rosella significantly reduces task response time, and adapts to environment changes quickly.

Related papers

High-Throughput LLM inference on Heterogeneous Clusters [6.11367906161332]
Large language model (LLM) inference on heterogeneous clusters presents two main challenges. A novel mechanism is proposed to schedule requests among instances, which fully considers the different processing capabilities of various instances. Extensive experiments show that the proposed scheduler improves throughput by 122.5% and 33.6% on two heterogeneous clusters.
arXiv Detail & Related papers (2025-04-18T08:59:11Z)
Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs [3.7758841366694353]
We survey scheduling techniques from the literature and from practical serving systems. We find that schedulers from the literature often achieve good performance but introduce significant complexity. In contrast, schedulers in practical deployments often leave easy performance gains on the table but are easy to implement, deploy and configure.
arXiv Detail & Related papers (2024-10-23T13:05:46Z)
Toward Smart Scheduling in Tapis [1.0377683220196874]
We present our efforts to develop an intelligent job scheduling capability in Tapis. We focus on one such specific challenge: predicting queue times for a job on different HPC systems and queues. Our first set of results cast the problem as a regression, which can be used to select the best system from a list of existing options.
arXiv Detail & Related papers (2024-08-05T20:01:31Z)
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices. We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
Learning Coordination Policies over Heterogeneous Graphs for Human-Robot Teams via Recurrent Neural Schedule Propagation [0.0]
HybridNet is a deep learning-based framework for scheduling human-robot teams. We develop a virtual scheduling environment for mixed human-robot teams in a multiround setting.
arXiv Detail & Related papers (2023-01-30T20:42:06Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
GCNScheduler: Scheduling Distributed Computing Applications using Graph Convolutional Networks [12.284934135116515]
We propose a graph convolutional network-based scheduler (GCNScheduler) By carefully integrating an inter-task data dependency structure with network settings into an input graph, the GCNScheduler can efficiently schedule tasks for a given objective. We show that it better makespan than the classic HEFT algorithm, and almost the same throughput as throughput-oriented HEFT.
arXiv Detail & Related papers (2021-10-22T01:54:10Z)
Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under Delayed Feedback [29.177402567437206]
We present a partially observable (PO) model that captures the scheduling decisions in parallel queuing systems under limited information of delayed acknowledgements. We numerically show that the resulting policy outperforms other limited information scheduling strategies. We show how our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
arXiv Detail & Related papers (2021-09-17T13:45:02Z)
Better than the Best: Gradient-based Improper Reinforcement Learning for Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay. We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z)
Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems. First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch. Second, for diverse system scales and structures, we use graph neural networks to embed system state information. Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z)
Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination. We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.