Related papers: Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers

Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers

URL: http://arxiv.org/abs/2312.06546v1
Date: Mon, 11 Dec 2023 17:31:46 GMT
Title: Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers
Authors: Mohamed S. Halawa and Rebeca P. D\'iaz-Redondo and Ana Fern\'andez-Vilas
Abstract summary: Key Performance Indicators (KPIs) generate a huge number of monitoring tasks that give data about CPU usage, memory usage, network traffic, or other sensors that monitor hardware. The main contribution in this paper is to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system. We have concluded that (i. those metrics (KPIs) related to the Network (interface) traffic monitoring provide the best cohesion and separation to cluster HPC jobs, and (ii. hierarchical clustering algorithms are the most suitable for this task
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Performance analysis is an essential task in High-Performance Computing (HPC) systems and it is applied for different purposes such as anomaly detection, optimal resource allocation, and budget planning. HPC monitoring tasks generate a huge number of Key Performance Indicators (KPIs) to supervise the status of the jobs running in these systems. KPIs give data about CPU usage, memory usage, network (interface) traffic, or other sensors that monitor the hardware. Analyzing this data, it is possible to obtain insightful information about running jobs, such as their characteristics, performance, and failures. The main contribution in this paper is to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system. With this aim, we have applied different clustering techniques (partition and hierarchical clustering algorithms) using a real dataset from the Galician Computation Center (CESGA). We have concluded that (i) those metrics (KPIs) related to the Network (interface) traffic monitoring provide the best cohesion and separation to cluster HPC jobs, and (ii) hierarchical clustering algorithms are the most suitable for this task. Our approach was validated using a different real dataset from the same HPC center.

Related papers

Harnessing Data from Clustered LQR Systems: Personalized and Collaborative Policy Optimization [4.770545528276722]
We show how clustering can be used in data-driven control to learn personalized policies.<n>Our work is the first to reveal how clustering can be used in data-driven control to learn personalized policies.
arXiv Detail & Related papers (2025-11-21T18:45:53Z)
What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets [0.0]
Supervised machine learning techniques rely on labeled data to achieve high task performance.<n>This paper evaluates the structure of benign traffic in several common intrusion detection datasets.
arXiv Detail & Related papers (2025-09-11T15:55:21Z)
ClusterRCA: Network Failure Diagnosis in HPC Systems Using Multimodal Data [10.100878764617747]
This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data.<n>To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches.<n> Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems.
arXiv Detail & Related papers (2025-06-17T16:52:09Z)
Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z)
A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z)
KPIs-Based Clustering and Visualization of HPC jobs: a Feature Reduction Approach [0.0]
HPC systems need to be constantly monitored to ensure their stability. The monitoring systems collect a tremendous amount of data about different parameters or Key Performance Indicators (KPIs), such as resource usage, IO waiting time, etc. A proper analysis of this data, usually stored as time series, can provide insight in choosing the right management strategies as well as the early detection of issues.
arXiv Detail & Related papers (2023-12-11T17:13:54Z)
PolicyClusterGCN: Identifying Efficient Clusters for Training Graph Convolutional Networks [23.437482392702627]
Graph convolutional networks (GCNs) have achieved huge success in several machine learning (ML) tasks on graph-structured data. We propose PolicyClusterGCN, an online RL framework that can identify good clusters for GCN training. We develop a novel Markov Decision Process (MDP) formulation that allows the policy network to predict importance" weights on the edges.
arXiv Detail & Related papers (2023-06-25T22:17:25Z)
ClusterNet: A Perception-Based Clustering Model for Scattered Data [16.326062082938215]
Cluster separation in scatterplots is a task that is typically tackled by widely used clustering techniques. We propose a learning strategy which directly operates on scattered data. We train ClusterNet, a point-based deep learning model, trained to reflect human perception of cluster separability.
arXiv Detail & Related papers (2023-04-27T13:41:12Z)
Task-Oriented Over-the-Air Computation for Multi-Device Edge AI [57.50247872182593]
6G networks for supporting edge AI features task-oriented techniques that focus on effective and efficient execution of AI task. Task-oriented over-the-air computation (AirComp) scheme is proposed in this paper for multi-device split-inference system.
arXiv Detail & Related papers (2022-11-02T16:35:14Z)
Task-Oriented Sensing, Computation, and Communication Integration for Multi-Device Edge AI [108.08079323459822]
This paper studies a new multi-intelligent edge artificial-latency (AI) system, which jointly exploits the AI model split inference and integrated sensing and communication (ISAC) We measure the inference accuracy by adopting an approximate but tractable metric, namely discriminant gain.
arXiv Detail & Related papers (2022-07-03T06:57:07Z)
Random projections and Kernelised Leave One Cluster Out Cross-Validation: Universal baselines and evaluation tools for supervised machine learning for materials properties [10.962094053749093]
leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials. We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to enhance LOCO-CV applications. We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception.
arXiv Detail & Related papers (2022-06-17T15:39:39Z)
Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty. We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives. These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z)
Cross-Gradient Aggregation for Decentralized Learning from Non-IID data [34.23789472226752]
Decentralized learning enables a group of collaborative agents to learn models using a distributed dataset without the need for a central parameter server. We propose Cross-Gradient Aggregation (CGA), a novel decentralized learning algorithm. We show superior learning performance of CGA over existing state-of-the-art decentralized learning algorithms.
arXiv Detail & Related papers (2021-03-02T21:58:12Z)
DAC: Deep Autoencoder-based Clustering, a General Deep Learning Framework of Representation Learning [0.0]
We propose DAC, Deep Autoencoder-based Clustering, a data-driven framework to learn clustering representations using deep neuron networks. Experiment results show that our approach could effectively boost performance of the KMeans clustering algorithm on a variety of datasets.
arXiv Detail & Related papers (2021-02-15T11:31:00Z)
Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain. We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one. Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z)
Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML. We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML. Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.