Exploring Error Bits for Memory Failure Prediction: An In-Depth
Correlative Study
- URL: http://arxiv.org/abs/2312.02855v2
- Date: Mon, 18 Dec 2023 15:30:26 GMT
- Title: Exploring Error Bits for Memory Failure Prediction: An In-Depth
Correlative Study
- Authors: Qiao Yu, Wengui Zhang, Jorge Cardoso and Odej Kao
- Abstract summary: We present a comprehensive study on the correlation between CEs and UEs.
Our analysis reveals a strong correlation between large-temporal error bits and UE occurrence.
Our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.
- Score: 5.292618442300404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In large-scale datacenters, memory failure is a common cause of server
crashes, with Uncorrectable Errors (UEs) being a major indicator of Dual Inline
Memory Module (DIMM) defects. Existing approaches primarily focus on predicting
UEs using Correctable Errors (CEs), without fully considering the information
provided by error bits. However, error bit patterns have a strong correlation
with the occurrence of UEs. In this paper, we present a comprehensive study on
the correlation between CEs and UEs, specifically emphasizing the importance of
spatio-temporal error bit information. Our analysis reveals a strong
correlation between spatio-temporal error bits and UE occurrence. Through
evaluations using real-world datasets, we demonstrate that our approach
significantly improves prediction performance by 15% in F1-score compared to
the state-of-the-art algorithms. Overall, our approach effectively reduces the
number of virtual machine interruptions caused by UEs by approximately 59%.
Related papers
- Investigating Memory Failure Prediction Across CPU Architectures [8.477622236186695]
We investigate the correlation between Correctable Errors (CEs) and Uncorrectable Errors (UEs) across different CPU architectures.
Our analysis identifies unique patterns of memory failure associated with each processor platform.
We conduct the memory failure prediction in different processors' platforms, achieving up to 15% improvements in F1-score compared to the existing algorithm.
arXiv Detail & Related papers (2024-06-08T05:10:23Z) - Average Calibration Error: A Differentiable Loss for Improved
Reliability in Image Segmentation [17.263160921956445]
We propose to use marginal L1 average calibration error (mL1-ACE) as a novel auxiliary loss function to improve pixel-wise calibration without compromising segmentation quality.
We show that this loss, despite using hard binning, is directly differentiable, bypassing the need for approximate but differentiable surrogate or soft binning approaches.
arXiv Detail & Related papers (2024-03-11T14:31:03Z) - BEC: Bit-Level Static Analysis for Reliability against Soft Errors [0.26107298043931204]
We propose a bit-level error coalescing (BEC) static program analysis to understand and improve program reliability against soft errors.
BEC analysis tracks each bit corruption in the register file and classifies the effect of the corruption by its semantics at compile time.
The proposed method is generic and not limited to a specific computer architecture.
arXiv Detail & Related papers (2024-01-11T09:03:47Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - Fast and Accurate Error Simulation for CNNs against Soft Errors [64.54260986994163]
We present a framework for the reliability analysis of Conal Neural Networks (CNNs) via an error simulation engine.
These error models are defined based on the corruption patterns of the output of the CNN operators induced by faults.
We show that our methodology achieves about 99% accuracy of the fault effects w.r.t. SASSIFI, and a speedup ranging from 44x up to 63x w.r.t.FI, that only implements a limited set of error models.
arXiv Detail & Related papers (2022-06-04T19:45:02Z) - Local Learning Matters: Rethinking Data Heterogeneity in Federated
Learning [61.488646649045215]
Federated learning (FL) is a promising strategy for performing privacy-preserving, distributed learning with a network of clients (i.e., edge devices)
arXiv Detail & Related papers (2021-11-28T19:03:39Z) - Tightening the Approximation Error of Adversarial Risk with Auto Loss
Function Search [12.263913626161155]
A common type of evaluation is to approximate the adversarial risk of a model as a robustness indicator.
We propose AutoLoss-AR, the first method for searching loss functions for tightening the error.
The results demonstrate the effectiveness of the proposed methods.
arXiv Detail & Related papers (2021-11-09T11:47:43Z) - Discriminative-Generative Dual Memory Video Anomaly Detection [81.09977516403411]
Recently, people tried to use a few anomalies for video anomaly detection (VAD) instead of only normal data during the training process.
We propose a DiscRiminative-gEnerative duAl Memory (DREAM) anomaly detection model to take advantage of a few anomalies and solve data imbalance.
arXiv Detail & Related papers (2021-04-29T15:49:01Z) - Collaborative Boundary-aware Context Encoding Networks for Error Map
Prediction [65.44752447868626]
We propose collaborative boundaryaware context encoding networks called AEP-Net for error prediction task.
Specifically, we propose a collaborative feature transformation branch for better feature fusion between images and masks, and precise localization of error regions.
The AEP-Net achieves an average DSC of 0.8358, 0.8164 for error prediction task, and shows a high Pearson correlation coefficient of 0.9873.
arXiv Detail & Related papers (2020-06-25T12:42:01Z) - An Investigation of Why Overparameterization Exacerbates Spurious
Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior.
We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.