Related papers: X-lifecycle Learning for Cloud Incident Management using LLMs

X-lifecycle Learning for Cloud Incident Management using LLMs

URL: http://arxiv.org/abs/2404.03662v1
Date: Thu, 15 Feb 2024 06:19:02 GMT
Title: X-lifecycle Learning for Cloud Incident Management using LLMs
Authors: Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, Saravan Rajmohan,
Abstract summary: Incident management for large cloud services is a complex and tedious process. Recent advancements in large language models [LLMs] created opportunities to automatically generate contextual recommendations. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance.
Score: 18.076347758182067
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Incident management for large cloud services is a complex and tedious process and requires significant amount of manual efforts from on-call engineers (OCEs). OCEs typically leverage data from different stages of the software development lifecycle [SDLC] (e.g., codes, configuration, monitor data, service properties, service dependencies, trouble-shooting documents, etc.) to generate insights for detection, root causing and mitigating of incidents. Recent advancements in large language models [LLMs] (e.g., ChatGPT, GPT-4, Gemini) created opportunities to automatically generate contextual recommendations to the OCEs assisting them to quickly identify and mitigate critical issues. However, existing research typically takes a silo-ed view for solving a certain task in incident management by leveraging data from a single stage of SDLC. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance of two critically important and practically challenging tasks: (1) automatically generating root cause recommendations for dependency failure related incidents, and (2) identifying ontology of service monitors used for automatically detecting incidents. By leveraging 353 incident and 260 monitor dataset from Microsoft, we demonstrate that augmenting contextual information from different stages of the SDLC improves the performance over State-of-The-Art methods.

Related papers

TAMO:Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data [33.5606443790794]
Large language models (LLMs) have made breakthroughs in contextual inference and domain knowledge integration. We propose a tool-assisted LLM agent with multi-modality observation data, namely TAMO, for fine-grained root cause analysis.
arXiv Detail & Related papers (2025-04-29T06:50:48Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL) Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations. These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z)
Exploring LLM-based Agents for Root Cause Analysis [17.053079105858497]
Root cause analysis (RCA) is a critical part of the incident management process. Large Language Models (LLMs) have been used to perform RCA, but are not able to collect additional diagnostic information. We present an evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft.
arXiv Detail & Related papers (2024-03-07T00:44:01Z)
Dependency Aware Incident Linking in Large Cloud Systems [8.797638977934646]
We propose dependency-aware incident linking (DiLink) framework to improve the accuracy and coverage of incident links. We also propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes.
arXiv Detail & Related papers (2024-02-05T13:54:11Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
ESRO: Experience Assisted Service Reliability against Outages [2.647000585570866]
We build a diagnostic service called ESRO that recommends root causes and remediation for failures. We evaluate our model on several cloud service outages of a large enterprise over the course of 2 years.
arXiv Detail & Related papers (2023-09-13T18:04:52Z)
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent [123.75169211547149]
We propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools. AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
arXiv Detail & Related papers (2023-06-13T20:50:22Z)
Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models [18.46643617658214]
On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x. We do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and production incidents.
arXiv Detail & Related papers (2023-01-10T05:41:40Z)
Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z)
Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users. We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z)
Anomaly Detection in Video via Self-Supervised and Multi-Task Learning [113.81927544121625]
Anomaly detection in video is a challenging computer vision problem. In this paper, we approach anomalous event detection in video through self-supervised and multi-task learning at the object level.
arXiv Detail & Related papers (2020-11-15T10:21:28Z)
Neural Knowledge Extraction From Cloud Service Incidents [13.86595381172654]
SoftNER is a framework for unsupervised knowledge extraction from service incidents. We build a novel multi-task learning based BiLSTM-CRF model. We show that the unsupervised machine learning based approach has a high precision of 0.96.
arXiv Detail & Related papers (2020-07-10T17:33:07Z)
Data Mining with Big Data in Intrusion Detection Systems: A Systematic Literature Review [68.15472610671748]
Cloud computing has become a powerful and indispensable technology for complex, high performance and scalable computation. The rapid rate and volume of data creation has begun to pose significant challenges for data management and security. The design and deployment of intrusion detection systems (IDS) in the big data setting has, therefore, become a topic of importance.
arXiv Detail & Related papers (2020-05-23T20:57:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.