Reproduction Research of FSA-Benchmark
- URL: http://arxiv.org/abs/2501.14739v3
- Date: Tue, 04 Feb 2025 20:42:40 GMT
- Title: Reproduction Research of FSA-Benchmark
- Authors: Joshua Ludolf, Yesmin Reyna-Hernandez, Matthew Trevino,
- Abstract summary: Failure-slow disks experience a gradual decline in performance before ultimately failing.<n>Unlike outright disk failures, fail-slow conditions can go undetected for prolonged periods, leading to considerable impacts on system performance and user experience.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the current landscape of big data, the reliability and performance of storage systems are essential to the success of various applications and services. as data volumes continue to grow exponentially, the complexity and scale of the storage infrastructures needed to manage this data also increase. a significant challenge faced by data centers and storage systems is the detection and management of fail-slow disks that experience a gradual decline in performance before ultimately failing. Unlike outright disk failures, fail-slow conditions can go undetected for prolonged periods, leading to considerable impacts on system performance and user experience.
Related papers
- Evaluating Fault Tolerance and Scalability in Distributed File Systems: A Case Study of GFS, HDFS, and MinIO [0.9307293959047378]
Distributed File Systems (DFS) are essential for managing vast datasets across multiple servers, offering benefits in scalability, fault tolerance, and data accessibility.
This paper presents a comprehensive evaluation of three prominent DFSs - Google File System (GFS), Hadoop Distributed File System (HDFS), and MinIO.
Through detailed analysis, how these systems handle data redundancy, server failures, and client access protocols, ensuring reliability in dynamic, large-scale environments is assessed.
arXiv Detail & Related papers (2025-02-04T03:52:45Z) - Towards Resource-Efficient Federated Learning in Industrial IoT for Multivariate Time Series Analysis [50.18156030818883]
Anomaly and missing data constitute a thorny problem in industrial applications.
Deep learning enabled anomaly detection has emerged as a critical direction.
The data collected in edge devices contain user privacy.
arXiv Detail & Related papers (2024-11-06T15:38:31Z) - Digital Twin-Assisted Data-Driven Optimization for Reliable Edge Caching in Wireless Networks [60.54852710216738]
We introduce a novel digital twin-assisted optimization framework, called D-REC, to ensure reliable caching in nextG wireless networks.
By incorporating reliability modules into a constrained decision process, D-REC can adaptively adjust actions, rewards, and states to comply with advantageous constraints.
arXiv Detail & Related papers (2024-06-29T02:40:28Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Design and Implementation of an Automated Disaster-recovery System for a
Kubernetes Cluster Using LSTM [0.0]
This study introduces a system structure that integrates management plat-forms with backup and restoration tools.
The experimental results show that this system executes the restoration process within 15 s without human intervention, enabling rapid recovery.
arXiv Detail & Related papers (2024-02-05T12:00:31Z) - Towards Learned Predictability of Storage Systems [0.0]
Storage systems have become a fundamental building block of datacenters.
Despite the growing popularity and interests in storage, designing and implementing reliable storage systems remains challenging.
To move towards predictability of storage systems, various mechanisms and field studies have been proposed in the past few years.
Based on three representative research works, we discuss where and how machine learning should be applied in this field.
arXiv Detail & Related papers (2023-07-30T17:53:08Z) - Large-scale End-of-Life Prediction of Hard Disks in Distributed
Datacenters [0.0]
Large-scale predictive analyses are performed using severely skewed health statistics data.
We present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails.
arXiv Detail & Related papers (2023-03-15T21:55:07Z) - Challenges and Solutions to Build a Data Pipeline to Identify Anomalies
in Enterprise System Performance [3.037408957267527]
We discuss challenges to harness data to operate our ML-based anomaly detection system.
We demonstrate that by addressing these data challenges, we not only improve the accuracy of our performance anomaly detection model by 30%, but also ensure that the model performance to never degrade over time.
arXiv Detail & Related papers (2021-12-13T22:30:07Z) - Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional
LSTM Networks [0.0]
We introduce methods of extracting meaningful attributes associated with operational failure and of pre-processing health statistics data.
We use a Bidirectional LSTM with a multi-day look back period to learn the temporal progression of health indicators and baseline them against vanilla LSTM and Random Forest models.
Our approach can predict the occurrence of disk failure with an accuracy of 96.4% considering test data 60 days before failure.
arXiv Detail & Related papers (2021-09-11T19:26:07Z) - An Analysis of Distributed Systems Syllabi With a Focus on
Performance-Related Topics [65.86247008403002]
We analyze a dataset of 51 current ( 2019-2020) Distributed Systems syllabi from top Computer Science programs.
We study the scale of the infrastructure mentioned in DS courses, from small client-server systems to cloud-scale, peer-to-peer, global-scale systems.
arXiv Detail & Related papers (2021-03-02T16:49:09Z) - Robust and Transferable Anomaly Detection in Log Data using Pre-Trained
Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users.
We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z) - Data Mining with Big Data in Intrusion Detection Systems: A Systematic
Literature Review [68.15472610671748]
Cloud computing has become a powerful and indispensable technology for complex, high performance and scalable computation.
The rapid rate and volume of data creation has begun to pose significant challenges for data management and security.
The design and deployment of intrusion detection systems (IDS) in the big data setting has, therefore, become a topic of importance.
arXiv Detail & Related papers (2020-05-23T20:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.