Related papers: ClusterRCA: Network Failure Diagnosis in HPC Systems Using Multimodal Data

ClusterRCA: Network Failure Diagnosis in HPC Systems Using Multimodal Data

URL: http://arxiv.org/abs/2506.20673v1
Date: Tue, 17 Jun 2025 16:52:09 GMT
Title: ClusterRCA: Network Failure Diagnosis in HPC Systems Using Multimodal Data
Authors: Yongqian Sun, Xijie Pan, Xiao Xiong, Lei Tao, Jiaju Wang, Shenglin Zhang, Yuan Yuan, Yuqi Li, Kunlin Jian,
Abstract summary: This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data.<n>To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches.<n> Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems.
Score: 10.100878764617747
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.

Related papers

Clustered Federated Learning for Generalizable FDIA Detection in Smart Grids with Heterogeneous Data [9.222461989780735]
False Data Injection Attacks (FDIAs) pose severe security risks to smart grids.<n>Traditional centralized training approaches not only face privacy risks and data sharing constraints but also incur high transmission costs.<n>This paper proposes Federated Cluster Average (FedClusAvg) to improve FDIA detection in Non-IID and resource-constrained environments.
arXiv Detail & Related papers (2025-07-20T15:10:43Z)
Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z)
Hypergraph-based multi-scale spatio-temporal graph convolution network for Time-Series anomaly detection [8.878898677348086]
Multi-dimensional time series anomaly detection technology plays an important role in many fields including aerospace, water treatment, cloud service providers, etc. It is becoming increasingly challenging to perform effective and accurate anomaly detection in high-dimensional and complex data sets. We propose a hypergraph basedtemporal graph convolutional network model STGCN_Hyper, which explicitly captures high-order, multi-hop correlations between multiple variables. Our model can flexibly learn the multi-scale time series features in the data and the dependencies between features, and outperforms most existing baseline models in terms of precision, recall, F1-score on anomaly detection
arXiv Detail & Related papers (2024-10-29T17:19:18Z)
Unsupervised Learning for Fault Detection of HVAC Systems: An OPTICS -based Approach for Terminal Air Handling Units [1.0878040851638]
This study introduces an unsupervised learning strategy to detect faults in terminal air handling units and their associated systems. The methodology involves pre-processing historical sensor data using Principal Component Analysis to streamline dimensions. Results showed that OPTICS consistently surpassed k-means in accuracy across seasons.
arXiv Detail & Related papers (2023-12-18T18:08:54Z)
Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers [0.0]
Key Performance Indicators (KPIs) generate a huge number of monitoring tasks that give data about CPU usage, memory usage, network traffic, or other sensors that monitor hardware. The main contribution in this paper is to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system. We have concluded that (i. those metrics (KPIs) related to the Network (interface) traffic monitoring provide the best cohesion and separation to cluster HPC jobs, and (ii. hierarchical clustering algorithms are the most suitable for this task
arXiv Detail & Related papers (2023-12-11T17:31:46Z)
NetRCA: An Effective Network Fault Cause Localization Algorithm [22.88986905436378]
Localizing root cause of network faults is crucial to network operation and maintenance. We propose a novel algorithm named NetRCA to deal with this problem. Experiments and analysis are conducted on the real-world dataset from ICASSP 2022 AIOps Challenge.
arXiv Detail & Related papers (2022-02-23T02:03:35Z)
Self-supervised Contrastive Attributed Graph Clustering [110.52694943592974]
We propose a novel attributed graph clustering network, namely Self-supervised Contrastive Attributed Graph Clustering (SCAGC) In SCAGC, by leveraging inaccurate clustering labels, a self-supervised contrastive loss, are designed for node representation learning. For the OOS nodes, SCAGC can directly calculate their clustering labels.
arXiv Detail & Related papers (2021-10-15T03:25:28Z)
Attention-driven Graph Clustering Network [49.040136530379094]
We propose a novel deep clustering method named Attention-driven Graph Clustering Network (AGCN) AGCN exploits a heterogeneous-wise fusion module to dynamically fuse the node attribute feature and the topological graph feature. AGCN can jointly perform feature learning and cluster assignment in an unsupervised fashion.
arXiv Detail & Related papers (2021-08-12T02:30:38Z)
TELESTO: A Graph Neural Network Model for Anomaly Classification in Cloud Services [77.454688257702]
Machine learning (ML) and artificial intelligence (AI) are applied on IT system operation and maintenance. One direction aims at the recognition of re-occurring anomaly types to enable remediation automation. We propose a method that is invariant to dimensionality changes of given data.
arXiv Detail & Related papers (2021-02-25T14:24:49Z)
TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs) To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics. To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
Contextual-Bandit Anomaly Detection for IoT Data in Distributed Hierarchical Edge Computing [65.78881372074983]
IoT devices can hardly afford complex deep neural networks (DNN) models, and offloading anomaly detection tasks to the cloud incurs long delay. We propose and build a demo for an adaptive anomaly detection approach for distributed hierarchical edge computing (HEC) systems. We show that our proposed approach significantly reduces detection delay without sacrificing accuracy, as compared to offloading detection tasks to the cloud.
arXiv Detail & Related papers (2020-04-15T06:13:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.