Related papers: Optimization of Topology-Aware Job Allocation on a High-Performance Computing Cluster by Neural Simulated Annealing

Optimization of Topology-Aware Job Allocation on a High-Performance Computing Cluster by Neural Simulated Annealing

URL: http://arxiv.org/abs/2302.03517v1
Date: Mon, 6 Feb 2023 03:13:03 GMT
Title: Optimization of Topology-Aware Job Allocation on a High-Performance Computing Cluster by Neural Simulated Annealing
Authors: Zekang Lan, Yan Xu, Yingkun Huang, Dian Huang, Shengzhong Feng
Abstract summary: Topology-aware job allocation problem (TJAP) is a problem that decides how to dedicate nodes to specific applications. In this paper, we study the window-based TJAP on a fat-tree network aiming at minimizing the cost of communication hop. Two special allocation strategies are considered, i.e., static continuity assignment strategy (SCAS) and dynamic continuity assignment strategy (DCAS)
Score: 4.215562786525106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jobs on high-performance computing (HPC) clusters can suffer significant performance degradation due to inter-job network interference. Topology-aware job allocation problem (TJAP) is such a problem that decides how to dedicate nodes to specific applications to mitigate inter-job network interference. In this paper, we study the window-based TJAP on a fat-tree network aiming at minimizing the cost of communication hop, a defined inter-job interference metric. The window-based approach for scheduling repeats periodically taking the jobs in the queue and solving an assignment problem that maps jobs to the available nodes. Two special allocation strategies are considered, i.e., static continuity assignment strategy (SCAS) and dynamic continuity assignment strategy (DCAS). For the SCAS, a 0-1 integer programming is developed. For the DCAS, an approach called neural simulated algorithm (NSA), which is an extension to simulated algorithm (SA) that learns a repair operator and employs them in a guided heuristic search, is proposed. The efficacy of NSA is demonstrated with a computational study against SA and SCIP. The results of numerical experiments indicate that both the model and algorithm proposed in this paper are effective.

Related papers

Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters [24.845122459974466]
This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm. By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models, A-SRPT strategically assigns jobs to the available GPU. A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy.
arXiv Detail & Related papers (2025-01-09T20:19:01Z)
Broad Critic Deep Actor Reinforcement Learning for Continuous Control [5.440090782797941]
A novel hybrid architecture for actor-critic reinforcement learning (RL) algorithms is introduced. The proposed architecture integrates the broad learning system (BLS) with deep neural networks (DNNs) The effectiveness of the proposed algorithm is evaluated by applying it to two classic continuous control tasks.
arXiv Detail & Related papers (2024-11-24T12:24:46Z)
MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion Control in Real Networks [63.24965775030673]
We propose a novel Reinforcement Learning (RL) approach to design generic Congestion Control (CC) algorithms. Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return. We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch.
arXiv Detail & Related papers (2023-02-02T18:27:20Z)
Scheduling Inference Workloads on Distributed Edge Clusters with Reinforcement Learning [11.007816552466952]
This paper focuses on the problem of scheduling inference queries on Deep Neural Networks in edge networks at short timescales. By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP. We design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions.
arXiv Detail & Related papers (2023-01-31T13:23:34Z)
A Comprehensively Improved Hybrid Algorithm for Learning Bayesian Networks: Multiple Compound Memory Erasing [0.0]
This paper presents a new hybrid algorithm, MCME (multiple compound memory erasing) MCME retains the advantages of the first two methods, solves the shortcomings of the above CI tests, and makes innovations in the scoring function in the direction discrimination stage. A large number of experiments show that MCME has better or similar performance than some existing algorithms.
arXiv Detail & Related papers (2022-12-05T12:52:07Z)
Task-Oriented Sensing, Computation, and Communication Integration for Multi-Device Edge AI [108.08079323459822]
This paper studies a new multi-intelligent edge artificial-latency (AI) system, which jointly exploits the AI model split inference and integrated sensing and communication (ISAC) We measure the inference accuracy by adopting an approximate but tractable metric, namely discriminant gain.
arXiv Detail & Related papers (2022-07-03T06:57:07Z)
Learning-based Measurement Scheduling for Loosely-Coupled Cooperative Localization [3.616948583169635]
In cooperative localization, communicating mobile agents use inter-agent relative measurements to improve their dead-reckoning-based global localization. Measurement scheduling enables an agent to decide which subset of available inter-agent relative measurements it should process when its computational resources are limited. This paper proposes a measurement scheduling for CL that follows the sequential computation approach but reduces the communication and cost by using a neural network-based surrogate model as a proxy for the SG's merit function.
arXiv Detail & Related papers (2021-12-06T08:06:29Z)
COPS: Controlled Pruning Before Training Starts [68.8204255655161]
State-of-the-art deep neural network (DNN) pruning techniques, applied one-shot before training starts, evaluate sparse architectures with the help of a single criterion -- called pruning score. In this work we do not concentrate on a single pruning criterion, but provide a framework for combining arbitrary GSSs to create more powerful pruning strategies.
arXiv Detail & Related papers (2021-07-27T08:48:01Z)
Waypoint Planning Networks [66.72790309889432]
We propose a hybrid algorithm based on LSTMs with a local kernel - a classic algorithm such as A*, and a global kernel using a learned algorithm. We compare WPN against A*, as well as related works including motion planning networks (MPNet) and value networks (VIN) It is shown that WPN's search space is considerably less than A*, while being able to generate near optimal results.
arXiv Detail & Related papers (2021-05-01T18:02:01Z)
Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments. It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD [32.03967072200476]
We propose an algorithmic approach named OverlapLocal-Local-Local-SGD (Local momentum variant) We achieve this by adding an anchor model on each node. After multiple local updates, locally trained models will be pulled back towards the anchor model rather than communicating with others.
arXiv Detail & Related papers (2020-02-21T20:33:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.