Related papers: Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters

Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters

URL: http://arxiv.org/abs/2303.08955v2
Date: Mon, 20 Mar 2023 22:35:49 GMT
Title: Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters
Authors: Rohan Mohapatra, Austin Coursey and Saptarshi Sengupta
Abstract summary: Large-scale predictive analyses are performed using severely skewed health statistics data. We present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare to aerospace. As such, premature disk failure and consequent loss of data can be catastrophic. To mitigate the risk of failures, cloud storage providers perform condition-based monitoring and replace hard disks before they fail. By estimating the remaining useful life of hard disk drives, one can predict the time-to-failure of a particular device and replace it at the right time, ensuring maximum utilization whilst reducing operational costs. In this work, large-scale predictive analyses are performed using severely skewed health statistics data by incorporating customized feature engineering and a suite of sequence learners. Past work suggests using LSTMs as an excellent approach to predicting remaining useful life. To this end, we present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails. The models developed in this work are trained and tested across an exhaustive set of all of the 10 years of S.M.A.R.T. health data in circulation from Backblaze and on a wide variety of disk instances. It closes the knowledge gap on what full-scale training achieves on thousands of devices and advances the state-of-the-art by providing tangible metrics for evaluation and generalization for practitioners looking to extend their workflow to all years of health data in circulation across disk manufacturers. The encoder-decoder LSTM posted an RMSE of 0.83 during training and 0.86 during testing over the exhaustive 10 year data while being able to generalize competitively over other drives from the Seagate family.

Related papers

Using Data Redundancy Techniques to Detect and Correct Errors in Logical Data [0.0]
We study the RAID scheme used with disk arrays and adapt it for use with logical data. We demonstrate robust performance in recovering arbitrary faults in large archive files only using a small fraction of redundant data.
arXiv Detail & Related papers (2025-03-20T06:07:13Z)
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. Data curation strategies are typically developed agnostic of the available compute for training. We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z)
TFBEST: Dual-Aspect Transformer with Learnable Positional Encoding for Failure Prediction [1.223779595809275]
We propose a Temporal-fusion Bi-encoder Self-attention Transformer (TFBEST) for predicting failures in hard-drives. It is an encoder-decoder based deep learning technique that enhances the context gained from understanding health statistics. Experiments on HDD data show that our method significantly outperforms the state-of-the-art RUL prediction methods.
arXiv Detail & Related papers (2023-09-06T01:03:14Z)
Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data. We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z)
Machine Learning Force Fields with Data Cost Aware Training [94.78998399180519]
Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation. Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels. We propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data.
arXiv Detail & Related papers (2023-06-05T04:34:54Z)
Enterprise Disk Drive Scrubbing Based on Mondrian Conformal Predictors [1.290382979353427]
scrubbing the entire storage array at once can adversely impact system performance. We propose a selective disk scrubbing method that enhances the overall reliability and power efficiency in data centers. By scrubbing just 22.7% of the total storage disks, we can achieve optimized energy consumption and reduce the carbon footprint of the data center.
arXiv Detail & Related papers (2023-06-01T04:11:22Z)
Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks [0.0]
We introduce methods of extracting meaningful attributes associated with operational failure and of pre-processing health statistics data. We use a Bidirectional LSTM with a multi-day look back period to learn the temporal progression of health indicators and baseline them against vanilla LSTM and Random Forest models. Our approach can predict the occurrence of disk failure with an accuracy of 96.4% considering test data 60 days before failure.
arXiv Detail & Related papers (2021-09-11T19:26:07Z)
Online detection of failures generated by storage simulator [2.3859858429583665]
We create a Go-based (golang) package for simulating the behavior of modern storage infrastructure. The package's flexible structure allows us to create a model of a real-world storage system with a number of components. To discover failures in the time series distribution generated by the simulator, we modified a change point detection algorithm that works in online mode.
arXiv Detail & Related papers (2021-01-18T14:56:53Z)
The Life and Death of SSDs and HDDs: Similarities, Differences, and Prediction Models [1.6795461001108098]
We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that constitute typical storage in data centers. We characterize the workload conditions that lead to failures and illustrate that their root causes differ from common expectation. We develop several machine learning failure prediction models that are shown to be surprisingly accurate.
arXiv Detail & Related papers (2020-12-22T21:50:32Z)
TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs) To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics. To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
Superiority of Simplicity: A Lightweight Model for Network Device Workload Prediction [58.98112070128482]
We propose a lightweight solution for series prediction based on historic observations. It consists of a heterogeneous ensemble method composed of two models - a neural network and a mean predictor. It achieves an overall $R2$ score of 0.10 on the available FedCSIS 2020 challenge dataset.
arXiv Detail & Related papers (2020-07-07T15:44:16Z)
Data Mining with Big Data in Intrusion Detection Systems: A Systematic Literature Review [68.15472610671748]
Cloud computing has become a powerful and indispensable technology for complex, high performance and scalable computation. The rapid rate and volume of data creation has begun to pose significant challenges for data management and security. The design and deployment of intrusion detection systems (IDS) in the big data setting has, therefore, become a topic of importance.
arXiv Detail & Related papers (2020-05-23T20:57:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.