TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on
Large-Scale Microservice Systems
- URL: http://arxiv.org/abs/2310.18740v1
- Date: Sat, 28 Oct 2023 15:49:00 GMT
- Title: TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on
Large-Scale Microservice Systems
- Authors: Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu,
Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan,
Qingwei Lin, Dongmei Zhang
- Abstract summary: Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems.
This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems.
- Score: 44.53009495726297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the
reliability of microservice systems. However, performing RCA on modern
microservice systems can be challenging due to their large scale, as they
usually comprise hundreds of components, leading significant human effort. This
paper proposes TraceDiag, an end-to-end RCA framework that addresses the
challenges for large-scale microservice systems. It leverages reinforcement
learning to learn a pruning policy for the service dependency graph to
automatically eliminates redundant components, thereby significantly improving
the RCA efficiency. The learned pruning policy is interpretable and fully
adaptive to new RCA instances. With the pruned graph, a causal-based method can
be executed with high accuracy and efficiency. The proposed TraceDiag framework
is evaluated on real data traces collected from the Microsoft Exchange system,
and demonstrates superior performance compared to state-of-the-art RCA
approaches. Notably, TraceDiag has been integrated as a critical component in
the Microsoft M365 Exchange, resulting in a significant improvement in the
system's reliability and a considerable reduction in the human effort required
for RCA.
Related papers
- Online Multi-modal Root Cause Analysis [61.94987309148539]
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems.
Existing online RCA methods handle only single-modal data overlooking, complex interactions in multi-modal systems.
We introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization.
arXiv Detail & Related papers (2024-10-13T21:47:36Z) - CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems [22.00860661894853]
We propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data.
CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization.
arXiv Detail & Related papers (2024-06-28T07:46:51Z) - LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis [32.816594249593955]
Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems.
We introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities.
We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset.
arXiv Detail & Related papers (2024-06-08T07:00:31Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - DANet: Enhancing Small Object Detection through an Efficient Deformable
Attention Network [0.0]
We propose a comprehensive strategy by synergizing Faster R-CNN with cutting-edge methods.
By combining Faster R-CNN with Feature Pyramid Network, we enable the model to handle multi-scale features intrinsic to manufacturing environments.
Deformable Net is used that contorts and conforms to the geometric variations of defects, bringing precision in detecting even the minuscule and complex features.
arXiv Detail & Related papers (2023-10-09T14:54:37Z) - Automatic Root Cause Analysis via Large Language Models for Cloud
Incidents [51.94361026233668]
We introduce RCACopilot, an on-call system empowered by a large language model for automating root cause analysis of cloud incidents.
RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative.
We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft.
arXiv Detail & Related papers (2023-05-25T06:44:50Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - LoRD-Net: Unfolded Deep Detection Network with Low-Resolution Receivers [104.01415343139901]
We propose a deep detector entitled LoRD-Net for recovering information symbols from one-bit measurements.
LoRD-Net has a task-based architecture dedicated to recovering the underlying signal of interest.
We evaluate the proposed receiver architecture for one-bit signal recovery in wireless communications.
arXiv Detail & Related papers (2021-02-05T04:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.