Exploring LLM-based Agents for Root Cause Analysis
- URL: http://arxiv.org/abs/2403.04123v1
- Date: Thu, 7 Mar 2024 00:44:01 GMT
- Title: Exploring LLM-based Agents for Root Cause Analysis
- Authors: Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro
Las-Casas, Rodrigo Fonseca, Saravan Rajmohan
- Abstract summary: Root cause analysis (RCA) is a critical part of the incident management process.
Large Language Models (LLMs) have been used to perform RCA, but are not able to collect additional diagnostic information.
We present an evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft.
- Score: 17.053079105858497
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The growing complexity of cloud based software systems has resulted in
incident management becoming an integral part of the software development
lifecycle. Root cause analysis (RCA), a critical part of the incident
management process, is a demanding task for on-call engineers, requiring deep
domain knowledge and extensive experience with a team's specific services.
Automation of RCA can result in significant savings of time, and ease the
burden of incident management on on-call engineers. Recently, researchers have
utilized Large Language Models (LLMs) to perform RCA, and have demonstrated
promising results. However, these approaches are not able to dynamically
collect additional diagnostic information such as incident related logs,
metrics or databases, severely restricting their ability to diagnose root
causes. In this work, we explore the use of LLM based agents for RCA to address
this limitation. We present a thorough empirical evaluation of a ReAct agent
equipped with retrieval tools, on an out-of-distribution dataset of production
incidents collected at Microsoft. Results show that ReAct performs
competitively with strong retrieval and reasoning baselines, but with highly
increased factual accuracy. We then extend this evaluation by incorporating
discussions associated with incident reports as additional inputs for the
models, which surprisingly does not yield significant performance improvements.
Lastly, we conduct a case study with a team at Microsoft to equip the ReAct
agent with tools that give it access to external diagnostic services that are
used by the team for manual RCA. Our results show how agents can overcome the
limitations of prior work, and practical considerations for implementing such a
system in practice.
Related papers
- AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - Anwendung von Causal-Discovery-Algorithmen zur Root-Cause-Analyse in der Fahrzeugmontage [0.2995925627097048]
Root Cause Analysis (RCA) is a quality management method that aims to systematically investigate and identify the cause-and-effect relationships of problems.
In modern production processes, large amounts of data are collected.
This publication demonstrates the application of Causal Discovery Algorithms (CDA) on data from the assembly of a leading automotive manufacturer.
arXiv Detail & Related papers (2024-07-23T11:22:33Z) - Root Cause Analysis In Microservice Using Neural Granger Causal
Discovery [12.35924469567586]
We propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning.
RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery.
In addition, RUN incorporates Pagerank with a vector to efficiently recommend the top-k root causes.
arXiv Detail & Related papers (2024-02-02T04:43:06Z) - RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models [46.476439550746136]
Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently.
We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage.
Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools.
arXiv Detail & Related papers (2023-10-25T03:53:31Z) - PyRCA: A Library for Metric-based Root Cause Analysis [66.72542200701807]
PyRCA is an open-source machine learning library of Root Cause Analysis (RCA) for Artificial Intelligence for IT Operations (AIOps)
It provides a holistic framework to uncover the complicated metric causal dependencies and automatically locate root causes of incidents.
arXiv Detail & Related papers (2023-06-20T09:55:10Z) - Automatic Root Cause Analysis via Large Language Models for Cloud
Incidents [51.94361026233668]
We introduce RCACopilot, an on-call system empowered by a large language model for automating root cause analysis of cloud incidents.
RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative.
We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft.
arXiv Detail & Related papers (2023-05-25T06:44:50Z) - Mining Root Cause Knowledge from Cloud Service Incident Investigations
for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes.
In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z) - Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior.
The retrieval process is trained to retrieve information from the dataset that may be useful in the current context.
We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.