Dependency Aware Incident Linking in Large Cloud Systems
- URL: http://arxiv.org/abs/2403.18639v1
- Date: Mon, 5 Feb 2024 13:54:11 GMT
- Title: Dependency Aware Incident Linking in Large Cloud Systems
- Authors: Supriyo Ghosh, Karish Grover, Jimmy Wong, Chetan Bansal, Rakesh Namineni, Mohit Verma, Saravan Rajmohan,
- Abstract summary: We propose dependency-aware incident linking (DiLink) framework to improve the accuracy and coverage of incident links.
We also propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes.
- Score: 8.797638977934646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite significant reliability efforts, large-scale cloud services inevitably experience production incidents that can significantly impact service availability and customer's satisfaction. Worse, in many cases one incident can lead to multiple downstream failures due to cascading effects that creates several related incidents across different dependent services. Often time On-call Engineers (OCEs) examine these incidents in silos that lead to significant amount of manual toil and increase the overall time-to-mitigate incidents. Therefore, developing efficient incident linking models is of paramount importance for grouping related incidents into clusters so as to quickly resolve major outages and reduce on-call fatigue. Existing incident linking methods mostly leverages textual and contextual information of incidents (e.g., title, description, severity, impacted components), thus failing to leverage the inter-dependencies between services. In this paper, we propose the dependency-aware incident linking (DiLink) framework which leverages both textual and service dependency graph information to improve the accuracy and coverage of incident links not only coming from same service, but also from different services and workloads. Furthermore, we propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes. Extensive experimental results on real-world incidents from 5 workloads of Microsoft demonstrate that our alignment method has an F1-score of 0.96 (14% gain over current state-of-the-art methods). We are also in the process of deploying this solution across 610 services from these 5 workloads for continuously supporting OCEs improving incident management and reducing manual toil.
Related papers
- Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses [76.59021017301127]
We propose a large-scale traffic crash language dataset, named CrashEvent, summarizing 19,340 real-world crash reports.
We further formulate the crash event feature learning as a novel text reasoning problem and further fine-tune various large language models (LLMs) to predict detailed accident outcomes.
Our experiments results show that our LLM-based approach not only predicts the severity of accidents but also classifies different types of accidents and predicts injury outcomes.
arXiv Detail & Related papers (2024-06-16T03:10:16Z) - Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models [53.50543146583101]
Fine-tuning large language models on small datasets can enhance their performance on specific downstream tasks.
Malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors.
We propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data.
arXiv Detail & Related papers (2024-06-12T18:33:11Z) - Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides [39.29715168284971]
Service teams compile troubleshooting knowledge into Guides (TSGs) accessible to on-call engineers (OCEs)
TSGs are often unstructured and incomplete, which requires manual interpretation by OCEs, leading to on-call fatigue and decreased productivity.
We propose Nissist which leverages TSGs and incident mitigation histories to provide proactive suggestions, reducing human intervention.
arXiv Detail & Related papers (2024-02-27T14:14:23Z) - X-lifecycle Learning for Cloud Incident Management using LLMs [18.076347758182067]
Incident management for large cloud services is a complex and tedious process.
Recent advancements in large language models [LLMs] created opportunities to automatically generate contextual recommendations.
In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance.
arXiv Detail & Related papers (2024-02-15T06:19:02Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - Mining Root Cause Knowledge from Cloud Service Incident Investigations
for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes.
In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z) - Graph-based Incident Aggregation for Large-Scale Online Service Systems [33.70557954446136]
We propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures.
A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations.
The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud.
arXiv Detail & Related papers (2021-08-27T08:48:55Z) - DeepTriage: Automated Transfer Assistance for Incidents in Cloud
Services [5.418912231064684]
We introduce DeepTriage, an intelligent incident transfer service combining machine learning techniques.
For highly impacted incidents, DeepTriage achieves F1 score from 76.3% - 91.3%.
DeepTriage has been deployed in Azure since October 2017 and is used by thousands of teams daily.
arXiv Detail & Related papers (2020-11-25T03:10:11Z) - Joint Constrained Learning for Event-Event Relation Extraction [94.3499255880101]
We propose a joint constrained learning framework for modeling event-event relations.
Specifically, the framework enforces logical constraints within and across multiple temporal and subevent relations.
We show that our joint constrained learning approach effectively compensates for the lack of jointly labeled data.
arXiv Detail & Related papers (2020-10-13T22:45:28Z) - Neural Knowledge Extraction From Cloud Service Incidents [13.86595381172654]
SoftNER is a framework for unsupervised knowledge extraction from service incidents.
We build a novel multi-task learning based BiLSTM-CRF model.
We show that the unsupervised machine learning based approach has a high precision of 0.96.
arXiv Detail & Related papers (2020-07-10T17:33:07Z) - Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.