Related papers: TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

URL: http://arxiv.org/abs/2310.18740v1
Date: Sat, 28 Oct 2023 15:49:00 GMT
Title: TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Authors: Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu, Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang
Abstract summary: Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems.
Score: 44.53009495726297
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA.

Related papers

Causal AI-based Root Cause Identification: Research to Practice at Scale [2.455633941531165]
We have developed a novel causality-based Root Cause Identification (RCI) algorithm that emphasizes causation over correlation. This paper highlights Instana's advanced failure diagnosis capabilities, discussing both the theoretical underpinnings and practical implementations of the RCI algorithm.
arXiv Detail & Related papers (2025-02-25T14:20:33Z)
AI-in-the-Loop Sensing and Communication Joint Design for Edge Intelligence [65.29835430845893]
We propose a framework that enhances edge intelligence through AI-in-the-loop joint sensing and communication. A key contribution of our work is establishing an explicit relationship between validation loss and the system's tunable parameters. We show that our framework reduces communication energy consumption by up to 77 percent and sensing costs measured by the number of samples by up to 52 percent.
arXiv Detail & Related papers (2025-02-14T14:56:58Z)
RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data [13.68949728404533]
Root cause analysis (RCA) for microservice systems has gained significant attention in recent years. There is still no standard benchmark that includes large-scale datasets and supports comprehensive evaluation environments. We introduce RCAEval, an open-source benchmark that provides datasets and an evaluation environment for RCA in microservice systems.
arXiv Detail & Related papers (2024-12-22T13:30:02Z)
Online Multi-modal Root Cause Analysis [61.94987309148539]
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Existing online RCA methods handle only single-modal data overlooking, complex interactions in multi-modal systems. We introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization.
arXiv Detail & Related papers (2024-10-13T21:47:36Z)
CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems [22.00860661894853]
We propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data. CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization.
arXiv Detail & Related papers (2024-06-28T07:46:51Z)
LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis [32.816594249593955]
Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. We introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset.
arXiv Detail & Related papers (2024-06-08T07:00:31Z)
Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z)
DANet: Enhancing Small Object Detection through an Efficient Deformable Attention Network [0.0]
We propose a comprehensive strategy by synergizing Faster R-CNN with cutting-edge methods. By combining Faster R-CNN with Feature Pyramid Network, we enable the model to handle multi-scale features intrinsic to manufacturing environments. Deformable Net is used that contorts and conforms to the geometric variations of defects, bringing precision in detecting even the minuscule and complex features.
arXiv Detail & Related papers (2023-10-09T14:54:37Z)
Automatic Root Cause Analysis via Large Language Models for Cloud Incidents [51.94361026233668]
We introduce RCACopilot, an on-call system empowered by a large language model for automating root cause analysis of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft.
arXiv Detail & Related papers (2023-05-25T06:44:50Z)
Disentangled Causal Graph Learning for Online Unsupervised Root Cause Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data. Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process. We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z)
LoRD-Net: Unfolded Deep Detection Network with Low-Resolution Receivers [104.01415343139901]
We propose a deep detector entitled LoRD-Net for recovering information symbols from one-bit measurements. LoRD-Net has a task-based architecture dedicated to recovering the underlying signal of interest. We evaluate the proposed receiver architecture for one-bit signal recovery in wireless communications.
arXiv Detail & Related papers (2021-02-05T04:26:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.