Large-scale memory failure prediction using mcelog-based Data Mining and
Machine Learning
- URL: http://arxiv.org/abs/2105.04547v1
- Date: Sat, 24 Apr 2021 11:38:05 GMT
- Title: Large-scale memory failure prediction using mcelog-based Data Mining and
Machine Learning
- Authors: Chengdong Yao
- Abstract summary: In the data center, unexpected downtime caused by memory failures can lead to a decline in the stability of the server.
This paper compares and summarizes some commonly used skills and the improvement they can bring.
The single model we proposed won the top 15th in the 2nd Alibaba Cloud AIOps Competition.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In the data center, unexpected downtime caused by memory failures can lead to
a decline in the stability of the server and even the entire information
technology infrastructure, which harms the business. Therefore, whether the
memory failure can be accurately predicted in advance has become one of the
most important issues to be studied in the data center. However, for the memory
failure prediction in the production system, it is necessary to solve technical
problems such as huge data noise and extreme imbalance between positive and
negative samples, and at the same time ensure the long-term stability of the
algorithm. This paper compares and summarizes some commonly used skills and the
improvement they can bring. The single model we proposed won the top 15th in
the 2nd Alibaba Cloud AIOps Competition belonging to the 25th Pacific-Asia
Conference on Knowledge Discovery and Data Mining.
Related papers
- Towards Robust Stability Prediction in Smart Grids: GAN-based Approach under Data Constraints and Adversarial Challenges [53.2306792009435]
We introduce a novel framework to detect instability in smart grids by employing only stable data.
It relies on a Generative Adversarial Network (GAN) where the generator is trained to create instability data that are used along with stable data to train the discriminator.
Our solution, tested on a dataset composed of real-world stable and unstable samples, achieve accuracy up to 97.5% in predicting grid stability and up to 98.9% in detecting adversarial attacks.
arXiv Detail & Related papers (2025-01-27T20:48:25Z) - Bridging Smart Meter Gaps: A Benchmark of Statistical, Machine Learning and Time Series Foundation Models for Data Imputation [0.0]
Gaps in time series data in smart grids can bias consumption analyses and hinder reliable predictions.
Generative Artificial Intelligence offers promising solutions that may outperform traditional statistical methods.
arXiv Detail & Related papers (2025-01-13T12:41:27Z) - Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner.
Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z) - Digital Twin-Assisted Data-Driven Optimization for Reliable Edge Caching in Wireless Networks [60.54852710216738]
We introduce a novel digital twin-assisted optimization framework, called D-REC, to ensure reliable caching in nextG wireless networks.
By incorporating reliability modules into a constrained decision process, D-REC can adaptively adjust actions, rewards, and states to comply with advantageous constraints.
arXiv Detail & Related papers (2024-06-29T02:40:28Z) - F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data [65.6499834212641]
We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm.
By considering domain similarities through task-specific metadata, our model improved generalization, where the excess risk decreases as the number of training tasks increases.
Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.
arXiv Detail & Related papers (2024-06-23T21:28:50Z) - Computationally and Memory-Efficient Robust Predictive Analytics Using Big Data [0.0]
This study navigates through the challenges of data uncertainties, storage limitations, and predictive data-driven modeling using big data.
We utilize Robust Principal Component Analysis (RPCA) for effective noise reduction and outlier elimination, and Optimal Sensor Placement (OSP) for efficient data compression and storage.
arXiv Detail & Related papers (2024-03-27T22:39:08Z) - Diffusion-based Time Series Data Imputation for Microsoft 365 [35.16965409097466]
We focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model.
Our experiments and application practice show that our model contributes to improving the performance of the downstream failure prediction task.
arXiv Detail & Related papers (2023-08-03T10:25:17Z) - Towards Learned Predictability of Storage Systems [0.0]
Storage systems have become a fundamental building block of datacenters.
Despite the growing popularity and interests in storage, designing and implementing reliable storage systems remains challenging.
To move towards predictability of storage systems, various mechanisms and field studies have been proposed in the past few years.
Based on three representative research works, we discuss where and how machine learning should be applied in this field.
arXiv Detail & Related papers (2023-07-30T17:53:08Z) - Measuring and Mitigating Local Instability in Deep Neural Networks [23.342675028217762]
We study how the predictions of a model change, even when it is retrained on the same data, as a consequence of principledity in the training process.
For Natural Language Understanding (NLU) tasks, we find instability in predictions for a significant fraction of queries.
We propose new data-centric methods that exploit our local stability estimates.
arXiv Detail & Related papers (2023-05-18T00:34:15Z) - Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced
Dataset and Benchmark [62.997667081978825]
The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident.
The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
arXiv Detail & Related papers (2022-05-20T21:15:26Z) - DRAM Failure Prediction in AIOps: Empirical Evaluation, Challenges and
Opportunities [17.21846133804582]
This paper presents a comprehensive empirical evaluation of diverse machine learning techniques for DRAM failure prediction.
We first formulate the problem as a multi-class classification task and exhaustively evaluate seven popular/state-of-the-art classifiers on both the individual and multiple data sources.
We then formulate the problem as an unsupervised anomaly detection task and evaluate three state-of-the-art anomaly detectors.
arXiv Detail & Related papers (2021-04-30T15:20:22Z) - Robust and Transferable Anomaly Detection in Log Data using Pre-Trained
Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users.
We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z) - Improving Uncertainty Calibration via Prior Augmented Data [56.88185136509654]
Neural networks have proven successful at learning from complex data distributions by acting as universal function approximators.
They are often overconfident in their predictions, which leads to inaccurate and miscalibrated probabilistic predictions.
We propose a solution by seeking out regions of feature space where the model is unjustifiably overconfident, and conditionally raising the entropy of those predictions towards that of the prior distribution of the labels.
arXiv Detail & Related papers (2021-02-22T07:02:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.