GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?
- URL: http://arxiv.org/abs/2508.12472v1
- Date: Sun, 17 Aug 2025 19:12:05 GMT
- Title: GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?
- Authors: Yifang Tian, Yaming Liu, Zichun Chong, Zihang Huang, Hans-Arno Jacobsen,
- Abstract summary: This paper introduces GALA, a novel framework for root cause analysis in microservice systems.<n> evaluated on an open-source benchmark, GALA achieves substantial improvements over state-of-the-art methods.<n>We show that GALA bridges the gap between automated failure diagnosis and practical incident resolution.
- Score: 9.394057684388027
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Root cause analysis (RCA) in microservice systems is challenging, requiring on-call engineers to rapidly diagnose failures across heterogeneous telemetry such as metrics, logs, and traces. Traditional RCA methods often focus on single modalities or merely rank suspect services, falling short of providing actionable diagnostic insights with remediation guidance. This paper introduces GALA, a novel multi-modal framework that combines statistical causal inference with LLM-driven iterative reasoning for enhanced RCA. Evaluated on an open-source benchmark, GALA achieves substantial improvements over state-of-the-art methods of up to 42.22% accuracy. Our novel human-guided LLM evaluation score shows GALA generates significantly more causally sound and actionable diagnostic outputs than existing methods. Through comprehensive experiments and a case study, we show that GALA bridges the gap between automated failure diagnosis and practical incident resolution by providing both accurate root cause identification and human-interpretable remediation guidance.
Related papers
- Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis [5.532586951580959]
We present a focused empirical evaluation that isolates an LLM's reasoning behavior.<n>We produce a labeled taxonomy of 16 common RCA reasoning failures and use an LLM-as-a-Judge for annotation.
arXiv Detail & Related papers (2026-01-29T18:23:26Z) - Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism [19.31110304702373]
SpecRCA is a speculative root cause analysis framework that adopts a textithypothesize-then-verify paradigm.<n>Preliminary experiments on the AIOps 2022 demonstrate that SpecRCA achieves superior accuracy and efficiency compared to existing approaches.
arXiv Detail & Related papers (2026-01-06T05:58:25Z) - End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning [52.12425911708585]
Deep-DxSearch is an agentic RAG system trained end-to-end with reinforcement learning (RL)<n>In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources.<n> Experiments demonstrate that our end-to-end RL training framework consistently outperforms prompt-engineering and training-free RAG approaches.
arXiv Detail & Related papers (2025-08-21T17:42:47Z) - Can Large Language Models Help Experimental Design for Causal Discovery? [94.66802142727883]
Large Language Model Guided Intervention Targeting (LeGIT) is a robust framework that effectively incorporates LLMs to augment existing numerical approaches for the intervention targeting in causal discovery.<n>LeGIT demonstrates significant improvements and robustness over existing methods and even surpasses humans.
arXiv Detail & Related papers (2025-03-03T03:43:05Z) - Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis [19.357332854860665]
A contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for Root Cause Analysis (RCA)<n>We propose Flow-of-Action, a pioneering Standard Operation Procedure ( SOP) enhanced multi-agent system.<n>Compared to the ReAct method's 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.
arXiv Detail & Related papers (2025-02-12T09:07:25Z) - MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.<n>Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.<n>We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z) - PORCA: Root Cause Analysis with Partially Observed Data [15.007249208547885]
Root Cause Analysis (RCA) aims at identifying the underlying causes of system faults by uncovering and analyzing the causal structure from complex systems.
Previous studies implicitly assume a full observation of the system, which neglect the effect of partial observation.
We propose PORCA, a novel RCA framework which can explore reliable root causes under both unobserved confounders and unobserved heterogeneity.
arXiv Detail & Related papers (2024-07-08T12:31:12Z) - BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection [11.627235799040388]
We propose an end-to-end approach that integrates anomaly detection and root cause analysis.
BarO consistently outperforms state-of-the-art approaches in both anomaly detection and root cause analysis.
arXiv Detail & Related papers (2024-05-15T13:32:59Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Causal Disentanglement Hidden Markov Model for Fault Diagnosis [55.90917958154425]
We propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism.
Specifically, we make full use of the time-series data and progressively disentangle the vibration signal into fault-relevant and fault-irrelevant factors.
To expand the scope of the application, we adopt unsupervised domain adaptation to transfer the learned disentangled representations to other working environments.
arXiv Detail & Related papers (2023-08-06T05:58:45Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - Causal Inference-Based Root Cause Analysis for Online Service Systems
with Intervention Recognition [11.067832313491449]
In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition.
We propose a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA)
The performance on a real-world dataset shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.
arXiv Detail & Related papers (2022-06-13T01:45:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.