Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis
- URL: http://arxiv.org/abs/2407.01710v1
- Date: Thu, 27 Jun 2024 10:25:37 GMT
- Title: Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis
- Authors: Shenglin Zhang, Sibo Xia, Wenzhao Fan, Binpeng Shi, Xiao Xiong, Zhenyu Zhong, Minghua Ma, Yongqian Sun, Dan Pei,
- Abstract summary: Independent deployment, decentralization, and frequent dynamic interactions introduce cascading failures.
These issues severely impact operation efficiency and user experience.
This survey provides a comprehensive review and primary analysis of 94 papers from 2003 to the present.
- Score: 10.92325792850306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern microservice systems have gained widespread adoption due to their high scalability, flexibility, and extensibility. However, the characteristics of independent deployment, decentralization, and frequent dynamic interactions also introduce the risk of cascading failures, making it challenging to achieve accurate failure diagnosis and rapid system recovery. These issues severely impact operation efficiency and user experience. Recognizing the crucial role of failure diagnosis in enhancing the stability and reliability of microservice systems, researchers have conducted extensive studies and achieved a series of significant outcomes. This survey provides a comprehensive review and primary analysis of 94 papers from 2003 to the present, including an overview of the fundamental concepts, a research framework, and problem statements. These insights aim to help researchers understand the latest research progress in failure diagnosis. Publicly available datasets, toolkits, and evaluation metrics are also compiled to assist practitioners in selecting and validating various techniques, providing a foundation to advance the domain beyond current practices.
Related papers
- Design-Reality Gap Analysis of Health Information Systems Failure [0.0]
This study investigates the factors contributing to the failure of Health Information Systems in a public hospital in South Africa.
Findings highlight several factors contributing to HIS failures, including system capacity constraints, inadequate IT risk management, and critical skills gaps.
This study underscores the importance of addressing design-reality gaps to improve HIS outcomes in public healthcare settings.
arXiv Detail & Related papers (2024-11-05T15:31:40Z) - Beyond One-Time Validation: A Framework for Adaptive Validation of Prognostic and Diagnostic AI-based Medical Devices [55.319842359034546]
Existing approaches often fall short in addressing the complexity of practically deploying these devices.
The presented framework emphasizes the importance of repeating validation and fine-tuning during deployment.
It is positioned within the current US and EU regulatory landscapes.
arXiv Detail & Related papers (2024-09-07T11:13:52Z) - TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework with Multimodal Data [11.373761837547852]
Microservice-based systems often suffer from reliability issues due to their intricate interactions and expanding scale.
Traditional failure diagnosis methods that use single-modal data can hardly cover all failure scenarios due to the restricted information.
We propose textitTVDiag, a multimodal failure diagnosis framework for locating culprit microservice instances and identifying their failure types.
arXiv Detail & Related papers (2024-07-29T05:26:57Z) - A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends [12.814440316872748]
This survey aims to provide a comprehensive, structured review of root cause analysis (RCA) techniques.
It explores methodologies that include metrics, traces, logs, and multi-model data.
arXiv Detail & Related papers (2024-07-23T11:02:49Z) - Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation [0.723226140060364]
This paper introduces a comprehensive benchmarking framework aimed at evaluating failure detection methodologies within medical image segmentation.
We identify the strengths and limitations of current failure detection metrics, advocating for the risk-coverage analysis as a holistic evaluation approach.
arXiv Detail & Related papers (2024-06-05T14:36:33Z) - A Foundational Framework and Methodology for Personalized Early and
Timely Diagnosis [84.6348989654916]
We propose the first foundational framework for early and timely diagnosis.
It builds on decision-theoretic approaches to outline the diagnosis process.
It integrates machine learning and statistical methodology for estimating the optimal personalized diagnostic path.
arXiv Detail & Related papers (2023-11-26T14:42:31Z) - Validating polyp and instrument segmentation methods in colonoscopy through Medico 2020 and MedAI 2021 Challenges [58.32937972322058]
"Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image (MedAI 2021)" competitions.
We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic.
arXiv Detail & Related papers (2023-07-30T16:08:45Z) - Understanding metric-related pitfalls in image analysis validation [59.15220116166561]
This work provides the first comprehensive common point of access to information on pitfalls related to validation metrics in image analysis.
Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy.
arXiv Detail & Related papers (2023-02-03T14:57:40Z) - A Domain-Agnostic Approach for Characterization of Lifelong Learning
Systems [128.63953314853327]
"Lifelong Learning" systems are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability.
We show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems.
arXiv Detail & Related papers (2023-01-18T21:58:54Z) - Human readable network troubleshooting based on anomaly detection and
feature scoring [11.593495085674343]
We present a system based on (i) unsupervised learning methods for detecting anomalies in the time domain, (ii) an attention mechanism to rank features in the feature space and (iii) an expert knowledge module.
We thoroughly evaluate the performance of the full system and of its individual building blocks.
arXiv Detail & Related papers (2021-08-26T14:20:36Z) - Clinical Outcome Prediction from Admission Notes using Self-Supervised
Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks.
Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets.
We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.