Optimization of Topology-Aware Job Allocation on a High-Performance
Computing Cluster by Neural Simulated Annealing
- URL: http://arxiv.org/abs/2302.03517v1
- Date: Mon, 6 Feb 2023 03:13:03 GMT
- Title: Optimization of Topology-Aware Job Allocation on a High-Performance
Computing Cluster by Neural Simulated Annealing
- Authors: Zekang Lan, Yan Xu, Yingkun Huang, Dian Huang, Shengzhong Feng
- Abstract summary: Topology-aware job allocation problem (TJAP) is a problem that decides how to dedicate nodes to specific applications.
In this paper, we study the window-based TJAP on a fat-tree network aiming at minimizing the cost of communication hop.
Two special allocation strategies are considered, i.e., static continuity assignment strategy (SCAS) and dynamic continuity assignment strategy (DCAS)
- Score: 4.215562786525106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Jobs on high-performance computing (HPC) clusters can suffer significant
performance degradation due to inter-job network interference. Topology-aware
job allocation problem (TJAP) is such a problem that decides how to dedicate
nodes to specific applications to mitigate inter-job network interference. In
this paper, we study the window-based TJAP on a fat-tree network aiming at
minimizing the cost of communication hop, a defined inter-job interference
metric. The window-based approach for scheduling repeats periodically taking
the jobs in the queue and solving an assignment problem that maps jobs to the
available nodes. Two special allocation strategies are considered, i.e., static
continuity assignment strategy (SCAS) and dynamic continuity assignment
strategy (DCAS). For the SCAS, a 0-1 integer programming is developed. For the
DCAS, an approach called neural simulated algorithm (NSA), which is an
extension to simulated algorithm (SA) that learns a repair operator and employs
them in a guided heuristic search, is proposed. The efficacy of NSA is
demonstrated with a computational study against SA and SCIP. The results of
numerical experiments indicate that both the model and algorithm proposed in
this paper are effective.
Related papers
- MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion
Control in Real Networks [63.24965775030673]
We propose a novel Reinforcement Learning (RL) approach to design generic Congestion Control (CC) algorithms.
Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return.
We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch.
arXiv Detail & Related papers (2023-02-02T18:27:20Z) - Scheduling Inference Workloads on Distributed Edge Clusters with
Reinforcement Learning [11.007816552466952]
This paper focuses on the problem of scheduling inference queries on Deep Neural Networks in edge networks at short timescales.
By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP.
We design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions.
arXiv Detail & Related papers (2023-01-31T13:23:34Z) - A Comprehensively Improved Hybrid Algorithm for Learning Bayesian
Networks: Multiple Compound Memory Erasing [0.0]
This paper presents a new hybrid algorithm, MCME (multiple compound memory erasing)
MCME retains the advantages of the first two methods, solves the shortcomings of the above CI tests, and makes innovations in the scoring function in the direction discrimination stage.
A large number of experiments show that MCME has better or similar performance than some existing algorithms.
arXiv Detail & Related papers (2022-12-05T12:52:07Z) - Task-Oriented Sensing, Computation, and Communication Integration for
Multi-Device Edge AI [108.08079323459822]
This paper studies a new multi-intelligent edge artificial-latency (AI) system, which jointly exploits the AI model split inference and integrated sensing and communication (ISAC)
We measure the inference accuracy by adopting an approximate but tractable metric, namely discriminant gain.
arXiv Detail & Related papers (2022-07-03T06:57:07Z) - Learning-based Measurement Scheduling for Loosely-Coupled Cooperative
Localization [3.616948583169635]
In cooperative localization, communicating mobile agents use inter-agent relative measurements to improve their dead-reckoning-based global localization.
Measurement scheduling enables an agent to decide which subset of available inter-agent relative measurements it should process when its computational resources are limited.
This paper proposes a measurement scheduling for CL that follows the sequential computation approach but reduces the communication and cost by using a neural network-based surrogate model as a proxy for the SG's merit function.
arXiv Detail & Related papers (2021-12-06T08:06:29Z) - An actor-critic algorithm with policy gradients to solve the job shop
scheduling problem using deep double recurrent agents [1.3812010983144802]
We propose a deep reinforcement learning methodology for the job shop scheduling problem (JSSP)
The aim is to build up a greedy-like able to learn on some distribution of JSSP instances, different in the number of jobs and machines.
As expected, the model can generalize, to some extent, to larger problems or instances originated by a different distribution from the one used in training.
arXiv Detail & Related papers (2021-10-18T07:55:39Z) - COPS: Controlled Pruning Before Training Starts [68.8204255655161]
State-of-the-art deep neural network (DNN) pruning techniques, applied one-shot before training starts, evaluate sparse architectures with the help of a single criterion -- called pruning score.
In this work we do not concentrate on a single pruning criterion, but provide a framework for combining arbitrary GSSs to create more powerful pruning strategies.
arXiv Detail & Related papers (2021-07-27T08:48:01Z) - Waypoint Planning Networks [66.72790309889432]
We propose a hybrid algorithm based on LSTMs with a local kernel - a classic algorithm such as A*, and a global kernel using a learned algorithm.
We compare WPN against A*, as well as related works including motion planning networks (MPNet) and value networks (VIN)
It is shown that WPN's search space is considerably less than A*, while being able to generate near optimal results.
arXiv Detail & Related papers (2021-05-01T18:02:01Z) - Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments.
It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays
in Distributed SGD [32.03967072200476]
We propose an algorithmic approach named OverlapLocal-Local-Local-SGD (Local momentum variant)
We achieve this by adding an anchor model on each node.
After multiple local updates, locally trained models will be pulled back towards the anchor model rather than communicating with others.
arXiv Detail & Related papers (2020-02-21T20:33:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.