Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks
- URL: http://arxiv.org/abs/2301.13799v1
- Date: Tue, 31 Jan 2023 17:41:07 GMT
- Title: Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks
- Authors: Christopher W. F. Parsonson, Zacharaya Shabka, Alessandro Ottino, and
Georgios Zervas
- Abstract summary: Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
- Score: 58.720142291102135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: From natural language processing to genome sequencing, large-scale machine
learning models are bringing advances to a broad range of fields. Many of these
models are too large to be trained on a single machine, and instead must be
distributed across multiple devices. This has motivated the research of new
compute and network systems capable of handling such tasks. In particular,
recent work has focused on developing management schemes which decide how to
allocate distributed resources such that some overall objective, such as
minimising the job completion time (JCT), is optimised. However, such studies
omit explicit consideration of how much a job should be distributed, usually
assuming that maximum distribution is desirable. In this work, we show that
maximum parallelisation is sub-optimal in relation to user-critical metrics
such as throughput and blocking rate. To address this, we propose PAC-ML
(partitioning for asynchronous computing with machine learning). PAC-ML
leverages a graph neural network and reinforcement learning to learn how much
to partition computation graphs such that the number of jobs which meet
arbitrary user-defined JCT requirements is maximised. In experiments with five
real deep learning computation graphs on a recently proposed optical
architecture across four user-defined JCT requirement distributions, we
demonstrate PAC-ML achieving up to 56.2% lower blocking rates in dynamic job
arrival settings than the canonical maximum parallelisation strategy used by
most prior works.
Related papers
- Multi-Task Learning as enabler for General-Purpose AI-native RAN [1.4295558450631414]
This study explores the effectiveness of multi-task learning (MTL) approaches in facilitating a general-purpose AI native Radio Access Network (RAN)
The investigation focuses on four RAN tasks: (i) secondary carrier prediction, (ii) user location prediction, (iii) indoor link classification, and (iv) line-of-sight link classification.
We validate the performance using realistic simulations considering multi-faceted design aspects of MTL including model architecture, loss and gradient balancing strategies, distributed learning topology, data sparsity and task groupings.
arXiv Detail & Related papers (2024-04-05T21:12:25Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to
Infer Hardware Performances [58.720142291102135]
'VPUNN' is a neural network-based cost model trained on low-level task profiling.
It consistently outperforms the state-of-the-art cost modeling in Intel's line of VPU processors.
arXiv Detail & Related papers (2022-05-09T22:48:39Z) - A Transferable Approach for Partitioning Machine Learning Models on
Multi-Chip-Modules [8.224904698490626]
Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine learning accelerators.
We present a strategy using a deep reinforcement learning framework to emit a possibly invalid candidate partition that is then corrected by a constraint solver.
Our evaluation of a production-scale model, BERT, on real hardware reveals that the partitioning generated using RL policy achieves 6.11% and 5.85% higher throughput.
arXiv Detail & Related papers (2021-12-07T23:40:28Z) - DistIR: An Intermediate Representation and Simulator for Efficient
Neural Network Distribution [15.086401550425125]
DistIR is a representation for distributed computation that is tailored for efficient analyses.
We show how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations.
arXiv Detail & Related papers (2021-11-09T21:32:51Z) - Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and
Personalized Federated Learning [56.17603785248675]
Model-agnostic meta-learning (MAML) has become a popular research area.
Existing MAML algorithms rely on the episode' idea by sampling a few tasks and data points to update the meta-model at each iteration.
This paper proposes memory-based algorithms for MAML that converge with vanishing error.
arXiv Detail & Related papers (2021-06-09T08:47:58Z) - Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML.
We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML.
Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z) - Deep Generative Models that Solve PDEs: Distributed Computing for
Training Large Data-Free Models [25.33147292369218]
Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs)
Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models.
Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods.
arXiv Detail & Related papers (2020-07-24T22:42:35Z) - Efficient Algorithms for Device Placement of DNN Graph Operators [12.871398348743591]
Modern machine learning workloads use large models, with complex structures, that are very expensive to execute.
The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of domain-specific accelerators being offered as hardware accelerators in addition to CPUs.
Recent work has shown that significant gains can be obtained with model parallelism, i.e., partitioning a neural network's computational graph onto multiple devices.
In this paper, we identify and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings.
arXiv Detail & Related papers (2020-06-29T22:45:01Z) - Fitting the Search Space of Weight-sharing NAS with Graph Convolutional
Networks [100.14670789581811]
We train a graph convolutional network to fit the performance of sampled sub-networks.
With this strategy, we achieve a higher rank correlation coefficient in the selected set of candidates.
arXiv Detail & Related papers (2020-04-17T19:12:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.