Related papers: Nara: Learning Network-Aware Resource Allocation Algorithms for Cloud Data Centres

Nara: Learning Network-Aware Resource Allocation Algorithms for Cloud Data Centres

URL: http://arxiv.org/abs/2106.02412v1
Date: Fri, 4 Jun 2021 10:56:49 GMT
Title: Nara: Learning Network-Aware Resource Allocation Algorithms for Cloud Data Centres
Authors: Zacharaya Shabka, Georgios Zervas
Abstract summary: Nara is a framework based on reinforcement learning and graph neural networks (GNN) to learn network-aware allocation policies. It can accept up to 33% more requests than the best baseline when deployed on DCNs with up to the order of $10times more compute nodes than the DCN seen during training. It is able to maintain its policy's performance on DCNs with the order of $100times$ more servers than seen during training.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data centres (DCs) underline many prominent future technological trends such as distributed training of large scale machine learning models and internet-of-things based platforms. DCs will soon account for over 3\% of global energy demand, so efficient use of DC resources is essential. Robust DC networks (DCNs) are essential to form the large scale systems needed to handle this demand, but can bottleneck how efficiently DC-server resources can be used when servers with insufficient connectivity between them cannot be jointly allocated to a job. However, allocating servers' resources whilst accounting for their inter-connectivity maps to an NP-hard combinatorial optimisation problem, and so is often ignored in DC resource management schemes. We present Nara, a framework based on reinforcement learning (RL) and graph neural networks (GNN) to learn network-aware allocation policies that increase the number of requests allocated over time compared to previous methods. Unique to our solution is the use of a GNN to generate representations of server-nodes in the DCN, which are then interpreted as actions by a RL policy-network which chooses from which servers resources will be allocated to incoming requests. Nara is agnostic to the topology size and shape and is trained end-to-end. The method can accept up to 33\% more requests than the best baseline when deployed on DCNs with up to the order of $10\times$ more compute nodes than the DCN seen during training and is able to maintain its policy's performance on DCNs with the order of $100\times$ more servers than seen during training. It also generalises to unseen DCN topologies with varied network structure and unseen request distributions without re-training.

Related papers

Joint Admission Control and Resource Allocation of Virtual Network Embedding via Hierarchical Deep Reinforcement Learning [69.00997996453842]
We propose a deep Reinforcement Learning approach to learn a joint Admission Control and Resource Allocation policy for virtual network embedding. We show that HRL-ACRA outperforms state-of-the-art baselines in terms of both the acceptance ratio and long-term average revenue.
arXiv Detail & Related papers (2024-06-25T07:42:30Z)
Task-Oriented Edge Networks: Decentralized Learning Over Wireless Fronthaul [13.150679121986792]
This paper studies task-oriented edge networks where multiple edge internet-of-things nodes execute machine learning tasks with the help of powerful deep neural networks (DNNs) at a network cloud.
arXiv Detail & Related papers (2023-12-03T05:24:28Z)
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices. We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z)
Network Aware Compute and Memory Allocation in Optically Composable Data Centres with Deep Reinforcement Learning and Graph Neural Networks [0.0]
Resource-disaggregated data centre architectures promise a means of pooling resources remotely within data centres. We show how this can be done using an optically switched circuit circuit backbone in the data centre network (DCN) We show how emphdeep reinforcement learning can be used to learn effective emphnetwork-aware and emphtopologically-scalable allocation policies end-to-end.
arXiv Detail & Related papers (2022-10-26T09:46:50Z)
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs [64.26714148634228]
congestion control (CC) algorithms become extremely difficult to design. It is currently not possible to deploy AI models on network devices due to their limited computational capabilities. We build a computationally-light solution based on a recent reinforcement learning CC algorithm.
arXiv Detail & Related papers (2022-07-05T20:42:24Z)
HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments [37.55572042288321]
Training process of neural networks (DNNs) generally handles large-scale input data with many sparse features. Paddle-HeterPS is composed of a distributed architecture and a Reinforcement Reinforcement (RL)-based scheduling method. We show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller)
arXiv Detail & Related papers (2021-11-20T17:09:15Z)
Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks [22.728439336309858]
We propose a communication-efficient distributed GNN training technique named $textLearn Locally, Correct Globally$ (LLCG) LLCG trains a GNN on its local data by ignoring the dependency between nodes among different machines, then sends the locally trained model to the server for periodic model averaging. We rigorously analyze the convergence of distributed methods with periodic model averaging for training GNNs and show that naively applying periodic model averaging but ignoring the dependency between nodes will suffer from an irreducible residual error.
arXiv Detail & Related papers (2021-11-16T03:07:01Z)
BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes [0.8201100713224002]
FCFS-based scheduling policies result in many transient idle nodes. We show how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training.
arXiv Detail & Related papers (2021-06-22T22:53:19Z)
Learning Connectivity for Data Distribution in Robot Teams [96.39864514115136]
We propose a task-agnostic, decentralized, low-latency method for data distribution in ad-hoc networks using Graph Neural Networks (GNN) Our approach enables multi-agent algorithms based on global state information to function by ensuring it is available at each robot. We train the distributed GNN communication policies via reinforcement learning using the average Age of Information as the reward function and show that it improves training stability compared to task-specific reward functions.
arXiv Detail & Related papers (2021-03-08T21:48:55Z)
Resource Allocation via Graph Neural Networks in Free Space Optical Fronthaul Networks [119.81868223344173]
This paper investigates the optimal resource allocation in free space optical (FSO) fronthaul networks. We consider the graph neural network (GNN) for the policy parameterization to exploit the FSO network structure. The primal-dual learning algorithm is developed to train the GNN in a model-free manner, where the knowledge of system models is not required.
arXiv Detail & Related papers (2020-06-26T14:20:48Z)
Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC. To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)
Large-Scale Gradient-Free Deep Learning with Recursive Local Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources. Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize. We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.