Nara: Learning Network-Aware Resource Allocation Algorithms for Cloud
Data Centres
- URL: http://arxiv.org/abs/2106.02412v1
- Date: Fri, 4 Jun 2021 10:56:49 GMT
- Title: Nara: Learning Network-Aware Resource Allocation Algorithms for Cloud
Data Centres
- Authors: Zacharaya Shabka, Georgios Zervas
- Abstract summary: Nara is a framework based on reinforcement learning and graph neural networks (GNN) to learn network-aware allocation policies.
It can accept up to 33% more requests than the best baseline when deployed on DCNs with up to the order of $10times more compute nodes than the DCN seen during training.
It is able to maintain its policy's performance on DCNs with the order of $100times$ more servers than seen during training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data centres (DCs) underline many prominent future technological trends such
as distributed training of large scale machine learning models and
internet-of-things based platforms. DCs will soon account for over 3\% of
global energy demand, so efficient use of DC resources is essential. Robust DC
networks (DCNs) are essential to form the large scale systems needed to handle
this demand, but can bottleneck how efficiently DC-server resources can be used
when servers with insufficient connectivity between them cannot be jointly
allocated to a job. However, allocating servers' resources whilst accounting
for their inter-connectivity maps to an NP-hard combinatorial optimisation
problem, and so is often ignored in DC resource management schemes. We present
Nara, a framework based on reinforcement learning (RL) and graph neural
networks (GNN) to learn network-aware allocation policies that increase the
number of requests allocated over time compared to previous methods. Unique to
our solution is the use of a GNN to generate representations of server-nodes in
the DCN, which are then interpreted as actions by a RL policy-network which
chooses from which servers resources will be allocated to incoming requests.
Nara is agnostic to the topology size and shape and is trained end-to-end. The
method can accept up to 33\% more requests than the best baseline when deployed
on DCNs with up to the order of $10\times$ more compute nodes than the DCN seen
during training and is able to maintain its policy's performance on DCNs with
the order of $100\times$ more servers than seen during training. It also
generalises to unseen DCN topologies with varied network structure and unseen
request distributions without re-training.
Related papers
- Joint Admission Control and Resource Allocation of Virtual Network Embedding via Hierarchical Deep Reinforcement Learning [69.00997996453842]
We propose a deep Reinforcement Learning approach to learn a joint Admission Control and Resource Allocation policy for virtual network embedding.
We show that HRL-ACRA outperforms state-of-the-art baselines in terms of both the acceptance ratio and long-term average revenue.
arXiv Detail & Related papers (2024-06-25T07:42:30Z) - Task-Oriented Edge Networks: Decentralized Learning Over Wireless
Fronthaul [13.150679121986792]
This paper studies task-oriented edge networks where multiple edge internet-of-things nodes execute machine learning tasks with the help of powerful deep neural networks (DNNs) at a network cloud.
arXiv Detail & Related papers (2023-12-03T05:24:28Z) - Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices.
We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling.
Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z) - Network Aware Compute and Memory Allocation in Optically Composable Data
Centres with Deep Reinforcement Learning and Graph Neural Networks [0.0]
Resource-disaggregated data centre architectures promise a means of pooling resources remotely within data centres.
We show how this can be done using an optically switched circuit circuit backbone in the data centre network (DCN)
We show how emphdeep reinforcement learning can be used to learn effective emphnetwork-aware and emphtopologically-scalable allocation policies end-to-end.
arXiv Detail & Related papers (2022-10-26T09:46:50Z) - Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs [64.26714148634228]
congestion control (CC) algorithms become extremely difficult to design.
It is currently not possible to deploy AI models on network devices due to their limited computational capabilities.
We build a computationally-light solution based on a recent reinforcement learning CC algorithm.
arXiv Detail & Related papers (2022-07-05T20:42:24Z) - HeterPS: Distributed Deep Learning With Reinforcement Learning Based
Scheduling in Heterogeneous Environments [37.55572042288321]
Training process of neural networks (DNNs) generally handles large-scale input data with many sparse features.
Paddle-HeterPS is composed of a distributed architecture and a Reinforcement Reinforcement (RL)-based scheduling method.
We show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller)
arXiv Detail & Related papers (2021-11-20T17:09:15Z) - Learn Locally, Correct Globally: A Distributed Algorithm for Training
Graph Neural Networks [22.728439336309858]
We propose a communication-efficient distributed GNN training technique named $textLearn Locally, Correct Globally$ (LLCG)
LLCG trains a GNN on its local data by ignoring the dependency between nodes among different machines, then sends the locally trained model to the server for periodic model averaging.
We rigorously analyze the convergence of distributed methods with periodic model averaging for training GNNs and show that naively applying periodic model averaging but ignoring the dependency between nodes will suffer from an irreducible residual error.
arXiv Detail & Related papers (2021-11-16T03:07:01Z) - BFTrainer: Low-Cost Training of Neural Networks on Unfillable
Supercomputer Nodes [0.8201100713224002]
FCFS-based scheduling policies result in many transient idle nodes.
We show how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training.
arXiv Detail & Related papers (2021-06-22T22:53:19Z) - Resource Allocation via Graph Neural Networks in Free Space Optical
Fronthaul Networks [119.81868223344173]
This paper investigates the optimal resource allocation in free space optical (FSO) fronthaul networks.
We consider the graph neural network (GNN) for the policy parameterization to exploit the FSO network structure.
The primal-dual learning algorithm is developed to train the GNN in a model-free manner, where the knowledge of system models is not required.
arXiv Detail & Related papers (2020-06-26T14:20:48Z) - Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G
Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC.
To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.