Learning Mean-Field Control for Delayed Information Load Balancing in
Large Queuing Systems
- URL: http://arxiv.org/abs/2208.04777v1
- Date: Tue, 9 Aug 2022 13:47:19 GMT
- Title: Learning Mean-Field Control for Delayed Information Load Balancing in
Large Queuing Systems
- Authors: Anam Tahir, Kai Cui, Heinz Koeppl
- Abstract summary: In this work, we consider a multi-agent load balancing system, with delayed information, consisting of many clients (load balancers) and many parallel queues.
We apply policy gradient reinforcement learning algorithms to find an optimal load balancing solution.
Our approach is scalable but also shows good performance when compared to the state-of-the-art power-of-d variant of the Join-the-Shortest-Queue (JSQ)
- Score: 26.405495663998828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have seen a great increase in the capacity and parallel
processing power of data centers and cloud services. To fully utilize the said
distributed systems, optimal load balancing for parallel queuing architectures
must be realized. Existing state-of-the-art solutions fail to consider the
effect of communication delays on the behaviour of very large systems with many
clients. In this work, we consider a multi-agent load balancing system, with
delayed information, consisting of many clients (load balancers) and many
parallel queues. In order to obtain a tractable solution, we model this system
as a mean-field control problem with enlarged state-action space in discrete
time through exact discretization. Subsequently, we apply policy gradient
reinforcement learning algorithms to find an optimal load balancing solution.
Here, the discrete-time system model incorporates a synchronization delay under
which the queue state information is synchronously broadcasted and updated at
all clients. We then provide theoretical performance guarantees for our
methodology in large systems. Finally, using experiments, we prove that our
approach is not only scalable but also shows good performance when compared to
the state-of-the-art power-of-d variant of the Join-the-Shortest-Queue (JSQ)
and other policies in the presence of synchronization delays.
Related papers
- Digital Twin-Assisted Federated Learning with Blockchain in Multi-tier Computing Systems [67.14406100332671]
In Industry 4.0 systems, resource-constrained edge devices engage in frequent data interactions.
This paper proposes a digital twin (DT) and federated digital twin (FL) scheme.
The efficacy of our proposed cooperative interference-based FL process has been verified through numerical analysis.
arXiv Detail & Related papers (2024-11-04T17:48:02Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Queuing dynamics of asynchronous Federated Learning [15.26212962081762]
We study asynchronous federated learning mechanisms with nodes having potentially different computational speeds.
We propose a non-uniform sampling scheme for the central server that allows for lower delays with better complexity.
Our experiments clearly show a significant improvement of our method over current state-of-the-art asynchronous algorithms on an image classification problem.
arXiv Detail & Related papers (2024-02-12T18:32:35Z) - Learning Distributed and Fair Policies for Network Load Balancing as
Markov Potentia Game [4.892398873024191]
This paper investigates the network load balancing problem in data centers (DCs) where multiple load balancers (LBs) are deployed.
The challenges of this problem consist of the heterogeneous processing architecture and dynamic environments.
We formulate the multi-agent load balancing problem as a Markov potential game, with a carefully and properly designed workload distribution fairness as the potential function.
A fully distributed MARL algorithm is proposed to approximate the Nash equilibrium of the game.
arXiv Detail & Related papers (2022-06-03T08:29:02Z) - Collaborative Intelligent Reflecting Surface Networks with Multi-Agent
Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks.
In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Blockchain-enabled Server-less Federated Learning [5.065631761462706]
We focus on an asynchronous server-less Federated Learning solution empowered by (BC) technology.
In contrast to mostly adopted FL approaches, we advocate an asynchronous method whereby model aggregation is done as clients submit their local updates.
arXiv Detail & Related papers (2021-12-15T07:41:23Z) - Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under
Delayed Feedback [29.177402567437206]
We present a partially observable (PO) model that captures the scheduling decisions in parallel queuing systems under limited information of delayed acknowledgements.
We numerically show that the resulting policy outperforms other limited information scheduling strategies.
We show how our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
arXiv Detail & Related papers (2021-09-17T13:45:02Z) - BAGUA: Scaling up Distributed Learning with System Relaxations [31.500494636704598]
BAGUA is a new communication framework for distributed data-parallel training.
Powered by the new system design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms.
In a production cluster with up to 16 machines, BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time.
arXiv Detail & Related papers (2021-07-03T21:27:45Z) - Better than the Best: Gradient-based Improper Reinforcement Learning for
Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay.
We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z) - Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems.
First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch.
Second, for diverse system scales and structures, we use graph neural networks to embed system state information.
Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.