Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System
- URL: http://arxiv.org/abs/2101.06582v1
- Date: Sun, 17 Jan 2021 03:45:25 GMT
- Title: Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System
- Authors: Yiwen Han and Shihao Shen and Xiaofei Wang and Shiqiang Wang and
Victor C.M. Leung
- Abstract summary: We introduce KaiS, a learning-based scheduling framework for edge-cloud systems.
First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch.
Second, for diverse system scales and structures, we use graph neural networks to embed system state information.
Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
- Score: 54.588242387136376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Kubernetes (k8s) has the potential to merge the distributed edge and the
cloud but lacks a scheduling framework specifically for edge-cloud systems.
Besides, the hierarchical distribution of heterogeneous resources and the
complex dependencies among requests and resources make the modeling and
scheduling of k8s-oriented edge-cloud systems particularly sophisticated. In
this paper, we introduce KaiS, a learning-based scheduling framework for such
edge-cloud systems to improve the long-term throughput rate of request
processing. First, we design a coordinated multi-agent actor-critic algorithm
to cater to decentralized request dispatch and dynamic dispatch spaces within
the edge cluster. Second, for diverse system scales and structures, we use
graph neural networks to embed system state information, and combine the
embedding results with multiple policy networks to reduce the orchestration
dimensionality by stepwise scheduling. Finally, we adopt a two-time-scale
scheduling mechanism to harmonize request dispatch and service orchestration,
and present the implementation design of deploying the above algorithms
compatible with native k8s components. Experiments using real workload traces
show that KaiS can successfully learn appropriate scheduling policies,
irrespective of request arrival patterns and system scales. Moreover, KaiS can
enhance the average system throughput rate by 14.3% while reducing scheduling
cost by 34.7% compared to baselines.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - CoRaiS: Lightweight Real-Time Scheduler for Multi-Edge Cooperative Computing [32.99310493126955]
Multi-edge cooperative computing that combines constrained resources of multiple edges into a powerful resource pool has the potential to deliver great benefits.
However, the mass heterogeneous resources composition and lack of scheduling strategies make the modeling and cooperating of multi-edge computing system particularly complicated.
This paper first proposes a system-level state evaluation model to shield the complex hardware configurations and redefine the different service capabilities at heterogeneous edges.
arXiv Detail & Related papers (2024-02-04T07:21:45Z) - GPU Cluster Scheduling for Network-Sensitive Deep Learning [19.344426053952464]
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads.
Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling.
arXiv Detail & Related papers (2024-01-29T19:06:08Z) - Client Orchestration and Cost-Efficient Joint Optimization for
NOMA-Enabled Hierarchical Federated Learning [55.49099125128281]
We propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation.
We show that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
arXiv Detail & Related papers (2023-11-03T13:34:44Z) - EGRC-Net: Embedding-induced Graph Refinement Clustering Network [66.44293190793294]
We propose a novel graph clustering network called Embedding-Induced Graph Refinement Clustering Network (EGRC-Net)
EGRC-Net effectively utilizes the learned embedding to adaptively refine the initial graph and enhance the clustering performance.
Our proposed methods consistently outperform several state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-19T09:08:43Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under
Delayed Feedback [29.177402567437206]
We present a partially observable (PO) model that captures the scheduling decisions in parallel queuing systems under limited information of delayed acknowledgements.
We numerically show that the resulting policy outperforms other limited information scheduling strategies.
We show how our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
arXiv Detail & Related papers (2021-09-17T13:45:02Z) - BAGUA: Scaling up Distributed Learning with System Relaxations [31.500494636704598]
BAGUA is a new communication framework for distributed data-parallel training.
Powered by the new system design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms.
In a production cluster with up to 16 machines, BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time.
arXiv Detail & Related papers (2021-07-03T21:27:45Z) - Better than the Best: Gradient-based Improper Reinforcement Learning for
Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay.
We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.