Related papers: Exploring LLM-based Agents for Root Cause Analysis

Exploring LLM-based Agents for Root Cause Analysis

URL: http://arxiv.org/abs/2403.04123v1
Date: Thu, 7 Mar 2024 00:44:01 GMT
Title: Exploring LLM-based Agents for Root Cause Analysis
Authors: Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, Saravan Rajmohan
Abstract summary: Root cause analysis (RCA) is a critical part of the incident management process. Large Language Models (LLMs) have been used to perform RCA, but are not able to collect additional diagnostic information. We present an evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft.
Score: 17.053079105858497
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a demanding task for on-call engineers, requiring deep domain knowledge and extensive experience with a team's specific services. Automation of RCA can result in significant savings of time, and ease the burden of incident management on on-call engineers. Recently, researchers have utilized Large Language Models (LLMs) to perform RCA, and have demonstrated promising results. However, these approaches are not able to dynamically collect additional diagnostic information such as incident related logs, metrics or databases, severely restricting their ability to diagnose root causes. In this work, we explore the use of LLM based agents for RCA to address this limitation. We present a thorough empirical evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft. Results show that ReAct performs competitively with strong retrieval and reasoning baselines, but with highly increased factual accuracy. We then extend this evaluation by incorporating discussions associated with incident reports as additional inputs for the models, which surprisingly does not yield significant performance improvements. Lastly, we conduct a case study with a team at Microsoft to equip the ReAct agent with tools that give it access to external diagnostic services that are used by the team for manual RCA. Our results show how agents can overcome the limitations of prior work, and practical considerations for implementing such a system in practice.

Related papers

Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM [13.293736787442414]
We introduce SynergyRCA, an innovative tool for root cause analysis.<n> SynergyRCA constructs a StateGraph to capture spatial and temporal relationships.<n>It can identify root causes in an average time of about two minutes and achieves an impressive precision of approximately 0.90.
arXiv Detail & Related papers (2025-06-03T06:09:13Z)
TAMO:Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data [33.5606443790794]
Large language models (LLMs) have made breakthroughs in contextual inference and domain knowledge integration. We propose a tool-assisted LLM agent with multi-modality observation data, namely TAMO, for fine-grained root cause analysis.
arXiv Detail & Related papers (2025-04-29T06:50:48Z)
Causal AI-based Root Cause Identification: Research to Practice at Scale [2.455633941531165]
We have developed a novel causality-based Root Cause Identification (RCI) algorithm that emphasizes causation over correlation. This paper highlights Instana's advanced failure diagnosis capabilities, discussing both the theoretical underpinnings and practical implementations of the RCI algorithm.
arXiv Detail & Related papers (2025-02-25T14:20:33Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
Anwendung von Causal-Discovery-Algorithmen zur Root-Cause-Analyse in der Fahrzeugmontage [0.2995925627097048]
Root Cause Analysis (RCA) is a quality management method that aims to systematically investigate and identify the cause-and-effect relationships of problems. In modern production processes, large amounts of data are collected. This publication demonstrates the application of Causal Discovery Algorithms (CDA) on data from the assembly of a leading automotive manufacturer.
arXiv Detail & Related papers (2024-07-23T11:22:33Z)
Root Cause Analysis In Microservice Using Neural Granger Causal Discovery [12.35924469567586]
We propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a vector to efficiently recommend the top-k root causes.
arXiv Detail & Related papers (2024-02-02T04:43:06Z)
RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models [46.476439550746136]
Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage. Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools.
arXiv Detail & Related papers (2023-10-25T03:53:31Z)
PyRCA: A Library for Metric-based Root Cause Analysis [66.72542200701807]
PyRCA is an open-source machine learning library of Root Cause Analysis (RCA) for Artificial Intelligence for IT Operations (AIOps) It provides a holistic framework to uncover the complicated metric causal dependencies and automatically locate root causes of incidents.
arXiv Detail & Related papers (2023-06-20T09:55:10Z)
Automatic Root Cause Analysis via Large Language Models for Cloud Incidents [51.94361026233668]
We introduce RCACopilot, an on-call system empowered by a large language model for automating root cause analysis of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft.
arXiv Detail & Related papers (2023-05-25T06:44:50Z)
Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z)
Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context. We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.