Related papers: Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data

Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data

URL: http://arxiv.org/abs/2407.00048v1
Date: Fri, 7 Jun 2024 06:35:17 GMT
Title: Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data
Authors: Priyanka Mudgal, Rita H. Wouhaybi,
Abstract summary: This research paper presents an in-depth analysis of extensive system telemetry data, proposing an ensemble methodology for detecting system failures. The proposed ensemble technique integrates a diverse set of algorithms, including Long Short-Term Memory (LSTM) networks, isolation forests, one-class support vector machines (OCSVM), and local outlier factors (LOF) Experimental evaluations demonstrate the remarkable efficacy of our models, achieving a notable detection rate in identifying system failures.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The growing reliance on computer systems, particularly personal computers (PCs), necessitates heightened reliability to uphold user satisfaction. This research paper presents an in-depth analysis of extensive system telemetry data, proposing an ensemble methodology for detecting system failures. Our approach entails scrutinizing various parameters of system metrics, encompassing CPU utilization, memory utilization, disk activity, CPU temperature, and pertinent system metadata such as system age, usage patterns, core count, and processor type. The proposed ensemble technique integrates a diverse set of algorithms, including Long Short-Term Memory (LSTM) networks, isolation forests, one-class support vector machines (OCSVM), and local outlier factors (LOF), to effectively discern system failures. Specifically, the LSTM network with other machine learning techniques is trained on Intel Computing Improvement Program (ICIP) telemetry software data to distinguish between normal and failed system patterns. Experimental evaluations demonstrate the remarkable efficacy of our models, achieving a notable detection rate in identifying system failures. Our research contributes to advancing the field of system reliability and offers practical insights for enhancing user experience in computing environments.

Related papers

A quantitative framework for evaluating architectural patterns in ML systems [49.1574468325115]
This study proposes a framework for quantitative assessment of architectural patterns in ML systems. We focus on scalability and performance metrics for cost-effective CPU-based inference.
arXiv Detail & Related papers (2025-01-20T15:30:09Z)
Effective Intrusion Detection in Heterogeneous Internet-of-Things Networks via Ensemble Knowledge Distillation-based Federated Learning [52.6706505729803]
We introduce Federated Learning (FL) to collaboratively train a decentralized shared model of Intrusion Detection Systems (IDS) FLEKD enables a more flexible aggregation method than conventional model fusion techniques. Experiment results show that the proposed approach outperforms local training and traditional FL in terms of both speed and performance.
arXiv Detail & Related papers (2024-01-22T14:16:37Z)
Random resistive memory-based deep extreme point learning machine for unified visual processing [67.51600474104171]
We propose a novel hardware-software co-design, random resistive memory-based deep extreme point learning machine (DEPLM) Our co-design system achieves huge energy efficiency improvements and training cost reduction when compared to conventional systems.
arXiv Detail & Related papers (2023-12-14T09:46:16Z)
A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of Multifidelity HPC Systems [17.246865176910045]
Hardware system events and behaviors are crucial to improving the robustness and reliability of these systems. In this work, we aim to build a holistic analytical system that helps make sense of such massive data. This end-to-end log analysis system, coupled with visual analytics support, allows users to glean and promptly extract supercomputer usage and error patterns.
arXiv Detail & Related papers (2023-06-15T19:23:50Z)
FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems [61.335229621081346]
Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge. In this paper, we propose FLEdge, which complements existing FL benchmarks by enabling a systematic evaluation of client capabilities.
arXiv Detail & Related papers (2023-06-08T13:11:20Z)
Interactive System-wise Anomaly Detection [66.3766756452743]
Anomaly detection plays a fundamental role in various applications. It is challenging for existing methods to handle the scenarios where the instances are systems whose characteristics are not readily observed as data. We develop an end-to-end approach which includes an encoder-decoder module that learns system embeddings.
arXiv Detail & Related papers (2023-04-21T02:20:24Z)
MLPerfTM HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems [32.621917787044396]
We introduceerf HPC, a benchmark suite of scientific machine learning training applications driven by the MLCommonsTM Association. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior.
arXiv Detail & Related papers (2021-10-21T20:30:12Z)
An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols. We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z)
Multi Agent System for Machine Learning Under Uncertainty in Cyber Physical Manufacturing System [78.60415450507706]
Recent advancements in predictive machine learning has led to its application in various use cases in manufacturing. Most research focused on maximising predictive accuracy without addressing the uncertainty associated with it. In this paper, we determine the sources of uncertainty in machine learning and establish the success criteria of a machine learning system to function well under uncertainty.
arXiv Detail & Related papers (2021-07-28T10:28:05Z)
Monitoring and Diagnosability of Perception Systems [21.25149064251918]
We propose a mathematical model for runtime monitoring and fault detection and identification in perception systems. We demonstrate our monitoring system, dubbed PerSyS, in realistic simulations using the LGSVL self-driving simulator and the Apollo Auto autonomy software stack.
arXiv Detail & Related papers (2020-11-11T23:03:14Z)
A Comparative Study of AI-based Intrusion Detection Techniques in Critical Infrastructures [4.8041243535151645]
We present a comparative study of Artificial Intelligence (AI)-driven intrusion detection systems for wirelessly connected sensors that track crucial applications. Specifically, we present an in-depth analysis of the use of machine learning, deep learning and reinforcement learning solutions to recognize intrusive behavior in the collected traffic. Results present the performance metrics for three different IDSs namely the Adaptively Supervised and Clustered Hybrid IDS, Boltzmann Machine-based Clustered IDS and Q-learning based IDS.
arXiv Detail & Related papers (2020-07-24T20:55:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.