Related papers: Investigating Memory Failure Prediction Across CPU Architectures

Investigating Memory Failure Prediction Across CPU Architectures

URL: http://arxiv.org/abs/2406.05354v1
Date: Sat, 8 Jun 2024 05:10:23 GMT
Title: Investigating Memory Failure Prediction Across CPU Architectures
Authors: Qiao Yu, Wengui Zhang, Min Zhou, Jialiang Yu, Zhenli Sheng, Jasmin Bogatinovski, Jorge Cardoso, Odej Kao,
Abstract summary: We investigate the correlation between Correctable Errors (CEs) and Uncorrectable Errors (UEs) across different CPU architectures. Our analysis identifies unique patterns of memory failure associated with each processor platform. We conduct the memory failure prediction in different processors' platforms, achieving up to 15% improvements in F1-score compared to the existing algorithm.
Score: 8.477622236186695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs, yet they typically neglect how these errors vary between different CPU architectures, especially in terms of Error Correction Code (ECC) applicability. In this paper, we investigate the correlation between CEs and UEs across different CPU architectures, including X86 and ARM. Our analysis identifies unique patterns of memory failure associated with each processor platform. Leveraging Machine Learning (ML) techniques on production datasets, we conduct the memory failure prediction in different processors' platforms, achieving up to 15% improvements in F1-score compared to the existing algorithm. Finally, an MLOps (Machine Learning Operations) framework is provided to consistently improve the failure prediction in the production environment.

Related papers

DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [14.676716521856813]
Mixture-of-Experts (MoE) models face significant deployment challenges on memory-constrained devices. We presentP, an on-device MoE inference engine to optimize parallel GPU- CPU execution. P outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
arXiv Detail & Related papers (2024-12-16T07:59:21Z)
Kernel Approximation using Analog In-Memory Computing [3.5231018007564203]
Kernel functions are vital ingredients of several machine learning algorithms, but often incur significant memory and computational costs. We introduce an approach to kernel approximation in machine learning algorithms suitable for mixed-signal Analog In-Memory Computing (AIMC) architectures.
arXiv Detail & Related papers (2024-11-05T16:18:47Z)
KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z)
Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study [5.292618442300404]
We present a comprehensive study on the correlation between CEs and UEs. Our analysis reveals a strong correlation between large-temporal error bits and UE occurrence. Our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.
arXiv Detail & Related papers (2023-12-05T16:11:52Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Incremental Online Learning Algorithms Comparison for Gesture and Visual Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification. Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z)
An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System [9.429605859159023]
Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound. Memory-centric computing systems, with processing-in-memory capabilities, can alleviate this data movement bottleneck. We implement several representative classic ML algorithms on a real-world general-purpose PIM architecture.
arXiv Detail & Related papers (2022-07-16T09:39:53Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning [53.73083199055093]
We show that attention-based architectures (e.g., Transformers) are fairly robust to distribution shifts. Our experiments show that replacing convolutional networks with Transformers can greatly reduce catastrophic forgetting of previous devices.
arXiv Detail & Related papers (2021-06-10T21:04:18Z)
Diagonal Memory Optimisation for Machine Learning on Micro-controllers [21.222568055417717]
Micro controllers and low power CPUs are increasingly being used to perform inference with machine learning models. Small amounts of RAM available on these targets sets limits on size of models which can be executed. diagonal memory optimisation technique is described and shown to achieve memory savings of up to 34.5% when applied to eleven common models.
arXiv Detail & Related papers (2020-10-04T19:45:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.