Related papers: Learning Mean-Field Control for Delayed Information Load Balancing in Large Queuing Systems

Learning Mean-Field Control for Delayed Information Load Balancing in Large Queuing Systems

URL: http://arxiv.org/abs/2208.04777v1
Date: Tue, 9 Aug 2022 13:47:19 GMT
Title: Learning Mean-Field Control for Delayed Information Load Balancing in Large Queuing Systems
Authors: Anam Tahir, Kai Cui, Heinz Koeppl
Abstract summary: In this work, we consider a multi-agent load balancing system, with delayed information, consisting of many clients (load balancers) and many parallel queues. We apply policy gradient reinforcement learning algorithms to find an optimal load balancing solution. Our approach is scalable but also shows good performance when compared to the state-of-the-art power-of-d variant of the Join-the-Shortest-Queue (JSQ)
Score: 26.405495663998828
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent years have seen a great increase in the capacity and parallel processing power of data centers and cloud services. To fully utilize the said distributed systems, optimal load balancing for parallel queuing architectures must be realized. Existing state-of-the-art solutions fail to consider the effect of communication delays on the behaviour of very large systems with many clients. In this work, we consider a multi-agent load balancing system, with delayed information, consisting of many clients (load balancers) and many parallel queues. In order to obtain a tractable solution, we model this system as a mean-field control problem with enlarged state-action space in discrete time through exact discretization. Subsequently, we apply policy gradient reinforcement learning algorithms to find an optimal load balancing solution. Here, the discrete-time system model incorporates a synchronization delay under which the queue state information is synchronously broadcasted and updated at all clients. We then provide theoretical performance guarantees for our methodology in large systems. Finally, using experiments, we prove that our approach is not only scalable but also shows good performance when compared to the state-of-the-art power-of-d variant of the Join-the-Shortest-Queue (JSQ) and other policies in the presence of synchronization delays.

Related papers

Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [49.77734021302196]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features. Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z)
Benchmarking Dynamic SLO Compliance in Distributed Computing Continuum Systems [9.820223170841219]
Service Level Objectives (SLOs) in large-scale architectures are challenging due to their heterogeneous nature and varying service requirements. We present a benchmark of Active Inference -- an emerging method from neuroscience -- against three established reinforcement learning algorithms. We find that Active Inference is a promising approach for ensuring SLO compliance in DCCS, offering lower memory usage, stable CPU utilization, and fast convergence.
arXiv Detail & Related papers (2025-03-05T08:56:26Z)
Digital Twin-Assisted Federated Learning with Blockchain in Multi-tier Computing Systems [67.14406100332671]
In Industry 4.0 systems, resource-constrained edge devices engage in frequent data interactions. This paper proposes a digital twin (DT) and federated digital twin (FL) scheme. The efficacy of our proposed cooperative interference-based FL process has been verified through numerical analysis.
arXiv Detail & Related papers (2024-11-04T17:48:02Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Queuing dynamics of asynchronous Federated Learning [15.26212962081762]
We study asynchronous federated learning mechanisms with nodes having potentially different computational speeds. We propose a non-uniform sampling scheme for the central server that allows for lower delays with better complexity. Our experiments clearly show a significant improvement of our method over current state-of-the-art asynchronous algorithms on an image classification problem.
arXiv Detail & Related papers (2024-02-12T18:32:35Z)
Learning Distributed and Fair Policies for Network Load Balancing as Markov Potentia Game [4.892398873024191]
This paper investigates the network load balancing problem in data centers (DCs) where multiple load balancers (LBs) are deployed. The challenges of this problem consist of the heterogeneous processing architecture and dynamic environments. We formulate the multi-agent load balancing problem as a Markov potential game, with a carefully and properly designed workload distribution fairness as the potential function. A fully distributed MARL algorithm is proposed to approximate the Nash equilibrium of the game.
arXiv Detail & Related papers (2022-06-03T08:29:02Z)
Collaborative Intelligent Reflecting Surface Networks with Multi-Agent Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks. In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
Blockchain-enabled Server-less Federated Learning [5.065631761462706]
We focus on an asynchronous server-less Federated Learning solution empowered by (BC) technology. In contrast to mostly adopted FL approaches, we advocate an asynchronous method whereby model aggregation is done as clients submit their local updates.
arXiv Detail & Related papers (2021-12-15T07:41:23Z)
Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under Delayed Feedback [29.177402567437206]
We present a partially observable (PO) model that captures the scheduling decisions in parallel queuing systems under limited information of delayed acknowledgements. We numerically show that the resulting policy outperforms other limited information scheduling strategies. We show how our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
arXiv Detail & Related papers (2021-09-17T13:45:02Z)
BAGUA: Scaling up Distributed Learning with System Relaxations [31.500494636704598]
BAGUA is a new communication framework for distributed data-parallel training. Powered by the new system design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines, BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time.
arXiv Detail & Related papers (2021-07-03T21:27:45Z)
Better than the Best: Gradient-based Improper Reinforcement Learning for Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay. We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z)
Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems. First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch. Second, for diverse system scales and structures, we use graph neural networks to embed system state information. Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.