Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers
- URL: http://arxiv.org/abs/2312.06546v1
- Date: Mon, 11 Dec 2023 17:31:46 GMT
- Title: Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers
- Authors: Mohamed S. Halawa and Rebeca P. D\'iaz-Redondo and Ana
Fern\'andez-Vilas
- Abstract summary: Key Performance Indicators (KPIs) generate a huge number of monitoring tasks that give data about CPU usage, memory usage, network traffic, or other sensors that monitor hardware.
The main contribution in this paper is to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system.
We have concluded that (i. those metrics (KPIs) related to the Network (interface) traffic monitoring provide the best cohesion and separation to cluster HPC jobs, and (ii. hierarchical clustering algorithms are the most suitable for this task
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Performance analysis is an essential task in High-Performance Computing (HPC)
systems and it is applied for different purposes such as anomaly detection,
optimal resource allocation, and budget planning. HPC monitoring tasks generate
a huge number of Key Performance Indicators (KPIs) to supervise the status of
the jobs running in these systems. KPIs give data about CPU usage, memory
usage, network (interface) traffic, or other sensors that monitor the hardware.
Analyzing this data, it is possible to obtain insightful information about
running jobs, such as their characteristics, performance, and failures. The
main contribution in this paper is to identify which metric/s (KPIs) is/are the
most appropriate to identify/classify different types of jobs according to
their behavior in the HPC system. With this aim, we have applied different
clustering techniques (partition and hierarchical clustering algorithms) using
a real dataset from the Galician Computation Center (CESGA). We have concluded
that (i) those metrics (KPIs) related to the Network (interface) traffic
monitoring provide the best cohesion and separation to cluster HPC jobs, and
(ii) hierarchical clustering algorithms are the most suitable for this task.
Our approach was validated using a different real dataset from the same HPC
center.
Related papers
- A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - KPIs-Based Clustering and Visualization of HPC jobs: a Feature Reduction
Approach [0.0]
HPC systems need to be constantly monitored to ensure their stability.
The monitoring systems collect a tremendous amount of data about different parameters or Key Performance Indicators (KPIs), such as resource usage, IO waiting time, etc.
A proper analysis of this data, usually stored as time series, can provide insight in choosing the right management strategies as well as the early detection of issues.
arXiv Detail & Related papers (2023-12-11T17:13:54Z) - PolicyClusterGCN: Identifying Efficient Clusters for Training Graph
Convolutional Networks [23.437482392702627]
Graph convolutional networks (GCNs) have achieved huge success in several machine learning (ML) tasks on graph-structured data.
We propose PolicyClusterGCN, an online RL framework that can identify good clusters for GCN training.
We develop a novel Markov Decision Process (MDP) formulation that allows the policy network to predict importance" weights on the edges.
arXiv Detail & Related papers (2023-06-25T22:17:25Z) - ClusterNet: A Perception-Based Clustering Model for Scattered Data [16.326062082938215]
Cluster separation in scatterplots is a task that is typically tackled by widely used clustering techniques.
We propose a learning strategy which directly operates on scattered data.
We train ClusterNet, a point-based deep learning model, trained to reflect human perception of cluster separability.
arXiv Detail & Related papers (2023-04-27T13:41:12Z) - Task-Oriented Over-the-Air Computation for Multi-Device Edge AI [57.50247872182593]
6G networks for supporting edge AI features task-oriented techniques that focus on effective and efficient execution of AI task.
Task-oriented over-the-air computation (AirComp) scheme is proposed in this paper for multi-device split-inference system.
arXiv Detail & Related papers (2022-11-02T16:35:14Z) - Task-Oriented Sensing, Computation, and Communication Integration for
Multi-Device Edge AI [108.08079323459822]
This paper studies a new multi-intelligent edge artificial-latency (AI) system, which jointly exploits the AI model split inference and integrated sensing and communication (ISAC)
We measure the inference accuracy by adopting an approximate but tractable metric, namely discriminant gain.
arXiv Detail & Related papers (2022-07-03T06:57:07Z) - Random projections and Kernelised Leave One Cluster Out
Cross-Validation: Universal baselines and evaluation tools for supervised
machine learning for materials properties [10.962094053749093]
leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials.
We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to enhance LOCO-CV applications.
We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception.
arXiv Detail & Related papers (2022-06-17T15:39:39Z) - Policy Information Capacity: Information-Theoretic Measure for Task
Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty.
We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives.
These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z) - Cross-Gradient Aggregation for Decentralized Learning from Non-IID data [34.23789472226752]
Decentralized learning enables a group of collaborative agents to learn models using a distributed dataset without the need for a central parameter server.
We propose Cross-Gradient Aggregation (CGA), a novel decentralized learning algorithm.
We show superior learning performance of CGA over existing state-of-the-art decentralized learning algorithms.
arXiv Detail & Related papers (2021-03-02T21:58:12Z) - Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain
Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain.
We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one.
Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z) - Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML.
We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML.
Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.