A Machine Learning Approach to Online Fault Classification in HPC
Systems
- URL: http://arxiv.org/abs/2007.14241v1
- Date: Mon, 27 Jul 2020 15:36:56 GMT
- Title: A Machine Learning Approach to Online Fault Classification in HPC
Systems
- Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea
Bartolini, Andrea Borghesi
- Abstract summary: We propose a fault classification method for HPC systems based on machine learning.
The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner.
We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments.
- Score: 4.642153471124352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal,
failure rates both at the hardware and software levels will increase
significantly. Thus, detecting and classifying faults in HPC systems as they
occur and initiating corrective actions before they can transform into failures
becomes essential for continued operation. Central to this objective is fault
injection, which is the deliberate triggering of faults in a system so as to
observe their behavior in a controlled environment. In this paper, we propose a
fault classification method for HPC systems based on machine learning. The
novelty of our approach rests with the fact that it can be operated on streamed
data in an online manner, thus opening the possibility to devise and enact
control actions on the target system in real-time. We introduce a high-level,
easy-to-use fault injection tool called FINJ, with a focus on the management of
complex experiments. In order to train and evaluate our machine learning
classifiers, we inject faults to an in-house experimental HPC system using
FINJ, and generate a fault dataset which we describe extensively. Both FINJ and
the dataset are publicly available to facilitate resiliency research in the HPC
systems field. Experimental results demonstrate that our approach allows almost
perfect classification accuracy to be reached for different fault types with
low computational overhead and minimal delay.
Related papers
- Targeted Cause Discovery with Data-Driven Learning [66.86881771339145]
We propose a novel machine learning approach for inferring causal variables of a target variable from observations.
We employ a neural network trained to identify causality through supervised learning on simulated data.
Empirical results demonstrate the effectiveness of our method in identifying causal relationships within large-scale gene regulatory networks.
arXiv Detail & Related papers (2024-08-29T02:21:11Z) - Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification.
We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations.
Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z) - Effective Intrusion Detection in Heterogeneous Internet-of-Things Networks via Ensemble Knowledge Distillation-based Federated Learning [52.6706505729803]
We introduce Federated Learning (FL) to collaboratively train a decentralized shared model of Intrusion Detection Systems (IDS)
FLEKD enables a more flexible aggregation method than conventional model fusion techniques.
Experiment results show that the proposed approach outperforms local training and traditional FL in terms of both speed and performance.
arXiv Detail & Related papers (2024-01-22T14:16:37Z) - Unsupervised Learning for Fault Detection of HVAC Systems: An OPTICS
-based Approach for Terminal Air Handling Units [1.0878040851638]
This study introduces an unsupervised learning strategy to detect faults in terminal air handling units and their associated systems.
The methodology involves pre-processing historical sensor data using Principal Component Analysis to streamline dimensions.
Results showed that OPTICS consistently surpassed k-means in accuracy across seasons.
arXiv Detail & Related papers (2023-12-18T18:08:54Z) - Controlling dynamical systems to complex target states using machine
learning: next-generation vs. classical reservoir computing [68.8204255655161]
Controlling nonlinear dynamical systems using machine learning allows to drive systems into simple behavior like periodicity but also to more complex arbitrary dynamics.
We show first that classical reservoir computing excels at this task.
In a next step, we compare those results based on different amounts of training data to an alternative setup, where next-generation reservoir computing is used instead.
It turns out that while delivering comparable performance for usual amounts of training data, next-generation RC significantly outperforms in situations where only very limited data is available.
arXiv Detail & Related papers (2023-07-14T07:05:17Z) - A hybrid feature learning approach based on convolutional kernels for
ATM fault prediction using event-log data [5.859431341476405]
We present a predictive model based on a convolutional kernel (MiniROCKET and HYDRA) to extract features from event-log data.
The proposed methodology is applied to a significant real-world collected dataset.
The model was integrated into a container-based decision support system to support operators in the timely maintenance of ATMs.
arXiv Detail & Related papers (2023-05-17T08:55:53Z) - Interactive System-wise Anomaly Detection [66.3766756452743]
Anomaly detection plays a fundamental role in various applications.
It is challenging for existing methods to handle the scenarios where the instances are systems whose characteristics are not readily observed as data.
We develop an end-to-end approach which includes an encoder-decoder module that learns system embeddings.
arXiv Detail & Related papers (2023-04-21T02:20:24Z) - Online Dictionary Learning Based Fault and Cyber Attack Detection for
Power Systems [4.657875410615595]
This paper deals with the event and intrusion detection problem by leveraging a stream data mining classifier.
We first build a dictionary by learning higher-level features from unlabeled data.
Then, the labeled data are represented as sparse linear combinations of learned dictionary atoms.
We capitalize on those sparse codes to train the online classifier along with efficient change detectors.
arXiv Detail & Related papers (2021-08-24T23:17:58Z) - Detection of Dataset Shifts in Learning-Enabled Cyber-Physical Systems
using Variational Autoencoder for Regression [1.5039745292757671]
We propose an approach to detect the dataset shifts effectively for regression problems.
Our approach is based on the inductive conformal anomaly detection and utilizes a variational autoencoder for regression model.
We demonstrate our approach by using an advanced emergency braking system implemented in an open-source simulator for self-driving cars.
arXiv Detail & Related papers (2021-04-14T03:46:37Z) - TELESTO: A Graph Neural Network Model for Anomaly Classification in
Cloud Services [77.454688257702]
Machine learning (ML) and artificial intelligence (AI) are applied on IT system operation and maintenance.
One direction aims at the recognition of re-occurring anomaly types to enable remediation automation.
We propose a method that is invariant to dimensionality changes of given data.
arXiv Detail & Related papers (2021-02-25T14:24:49Z) - Assurance Monitoring of Cyber-Physical Systems with Machine Learning
Components [2.1320960069210484]
We investigate how to use the conformal prediction framework for assurance monitoring of Cyber-Physical Systems.
In order to handle high-dimensional inputs in real-time, we compute nonconformity scores using embedding representations of the learned models.
By leveraging conformal prediction, the approach provides well-calibrated confidence and can allow monitoring that ensures a bounded small error rate.
arXiv Detail & Related papers (2020-01-14T19:34:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.