Related papers: TFBEST: Dual-Aspect Transformer with Learnable Positional Encoding for Failure Prediction

TFBEST: Dual-Aspect Transformer with Learnable Positional Encoding for Failure Prediction

URL: http://arxiv.org/abs/2309.02641v1
Date: Wed, 6 Sep 2023 01:03:14 GMT
Title: TFBEST: Dual-Aspect Transformer with Learnable Positional Encoding for Failure Prediction
Authors: Rohan Mohapatra and Saptarshi Sengupta
Abstract summary: We propose a Temporal-fusion Bi-encoder Self-attention Transformer (TFBEST) for predicting failures in hard-drives. It is an encoder-decoder based deep learning technique that enhances the context gained from understanding health statistics. Experiments on HDD data show that our method significantly outperforms the state-of-the-art RUL prediction methods.
Score: 1.223779595809275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hard Disk Drive (HDD) failures in datacenters are costly - from catastrophic data loss to a question of goodwill, stakeholders want to avoid it like the plague. An important tool in proactively monitoring against HDD failure is timely estimation of the Remaining Useful Life (RUL). To this end, the Self-Monitoring, Analysis and Reporting Technology employed within HDDs (S.M.A.R.T.) provide critical logs for long-term maintenance of the security and dependability of these essential data storage devices. Data-driven predictive models in the past have used these S.M.A.R.T. logs and CNN/RNN based architectures heavily. However, they have suffered significantly in providing a confidence interval around the predicted RUL values as well as in processing very long sequences of logs. In addition, some of these approaches, such as those based on LSTMs, are inherently slow to train and have tedious feature engineering overheads. To overcome these challenges, in this work we propose a novel transformer architecture - a Temporal-fusion Bi-encoder Self-attention Transformer (TFBEST) for predicting failures in hard-drives. It is an encoder-decoder based deep learning technique that enhances the context gained from understanding health statistics sequences and predicts a sequence of the number of days remaining before a disk potentially fails. In this paper, we also provide a novel confidence margin statistic that can help manufacturers replace a hard-drive within a time frame. Experiments on Seagate HDD data show that our method significantly outperforms the state-of-the-art RUL prediction methods during testing over the exhaustive 10-year data from Backblaze (2013-present). Although validated on HDD failure prediction, the TFBEST architecture is well-suited for other prognostics applications and may be adapted for allied regression problems.

Related papers

Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data. We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z)
CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting [50.23240107430597]
We design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting. First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals. Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions. Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue.
arXiv Detail & Related papers (2023-05-20T05:16:31Z)
Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters [0.0]
Large-scale predictive analyses are performed using severely skewed health statistics data. We present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails.
arXiv Detail & Related papers (2023-03-15T21:55:07Z)
DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep Surrogate Model [12.335763358698564]
We propose DeepFT to proactively avoid system overloads and their adverse effects. DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system. It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts.
arXiv Detail & Related papers (2022-12-02T16:51:58Z)
GNN4REL: Graph Neural Networks for Predicting Circuit Reliability Degradation [7.650966670809372]
We employ graph neural networks (GNNs) to accurately estimate the impact of process variations and device aging on the delay of any path within a circuit. GNN4REL is trained on a FinFET technology model that is calibrated against industrial 14nm measurement data. We successfully estimate delay degradations of all paths -- notably within seconds -- with a mean absolute error down to 0.01 percentage points.
arXiv Detail & Related papers (2022-08-04T20:09:12Z)
LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak Supervision [63.08516384181491]
We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts. Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect. Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
arXiv Detail & Related papers (2021-11-02T15:16:08Z)
Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks [0.0]
We introduce methods of extracting meaningful attributes associated with operational failure and of pre-processing health statistics data. We use a Bidirectional LSTM with a multi-day look back period to learn the temporal progression of health indicators and baseline them against vanilla LSTM and Random Forest models. Our approach can predict the occurrence of disk failure with an accuracy of 96.4% considering test data 60 days before failure.
arXiv Detail & Related papers (2021-09-11T19:26:07Z)
Uncertainty-aware Remaining Useful Life predictor [57.74855412811814]
Remaining Useful Life (RUL) estimation is the problem of inferring how long a certain industrial asset can be expected to operate. In this work, we consider Deep Gaussian Processes (DGPs) as possible solutions to the aforementioned limitations. The performance of the algorithms is evaluated on the N-CMAPSS dataset from NASA for aircraft engines.
arXiv Detail & Related papers (2021-04-08T08:50:44Z)
Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users. We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z)
Online detection of failures generated by storage simulator [2.3859858429583665]
We create a Go-based (golang) package for simulating the behavior of modern storage infrastructure. The package's flexible structure allows us to create a model of a real-world storage system with a number of components. To discover failures in the time series distribution generated by the simulator, we modified a change point detection algorithm that works in online mode.
arXiv Detail & Related papers (2021-01-18T14:56:53Z)
TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs) To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics. To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.