Related papers: Dependency Aware Incident Linking in Large Cloud Systems

Dependency Aware Incident Linking in Large Cloud Systems

URL: http://arxiv.org/abs/2403.18639v1
Date: Mon, 5 Feb 2024 13:54:11 GMT
Title: Dependency Aware Incident Linking in Large Cloud Systems
Authors: Supriyo Ghosh, Karish Grover, Jimmy Wong, Chetan Bansal, Rakesh Namineni, Mohit Verma, Saravan Rajmohan,
Abstract summary: We propose dependency-aware incident linking (DiLink) framework to improve the accuracy and coverage of incident links. We also propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes.
Score: 8.797638977934646
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant reliability efforts, large-scale cloud services inevitably experience production incidents that can significantly impact service availability and customer's satisfaction. Worse, in many cases one incident can lead to multiple downstream failures due to cascading effects that creates several related incidents across different dependent services. Often time On-call Engineers (OCEs) examine these incidents in silos that lead to significant amount of manual toil and increase the overall time-to-mitigate incidents. Therefore, developing efficient incident linking models is of paramount importance for grouping related incidents into clusters so as to quickly resolve major outages and reduce on-call fatigue. Existing incident linking methods mostly leverages textual and contextual information of incidents (e.g., title, description, severity, impacted components), thus failing to leverage the inter-dependencies between services. In this paper, we propose the dependency-aware incident linking (DiLink) framework which leverages both textual and service dependency graph information to improve the accuracy and coverage of incident links not only coming from same service, but also from different services and workloads. Furthermore, we propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes. Extensive experimental results on real-world incidents from 5 workloads of Microsoft demonstrate that our alignment method has an F1-score of 0.96 (14% gain over current state-of-the-art methods). We are also in the process of deploying this solution across 610 services from these 5 workloads for continuously supporting OCEs improving incident management and reducing manual toil.

Related papers

Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation [77.90555621662345]
We present JEF Hinter, an agentic system that distills offline traces into compact, context-aware hints.<n>A zooming mechanism highlights decisive steps in long trajectories, capturing both strategies and pitfalls.<n>Experiments on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-10-05T21:34:42Z)
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets [64.96967819446553]
This paper investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks.<n>High similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks.<n>Low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%.
arXiv Detail & Related papers (2025-06-05T17:59:55Z)
Breaking Focus: Contextual Distraction Curse in Large Language Models [68.4534308805202]
We investigate a critical vulnerability in Large Language Models (LLMs) This phenomenon arises when models fail to maintain consistent performance on questions modified with semantically coherent but irrelevant context. We propose an efficient tree-based search methodology to automatically generate CDV examples.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
Using Causality for Enhanced Prediction of Web Traffic Time Series [36.39678202395453]
We propose an effective neural network module, CCMPlus, designed to extract causal relationship features across services. Our method surpasses state-of-the-art approaches in Mean Squared Error (MSE) and Mean Absolute Error (MAE) for predicting service traffic time series.
arXiv Detail & Related papers (2025-02-02T00:36:40Z)
Federated Granger Causality Learning for Interdependent Clients with State Space Representation [0.6499759302108926]
We develop a federated approach to learning Granger causality. We propose augmenting the client models with the Granger causality information learned by the server. We also study the convergence of the framework to a centralized oracle model.
arXiv Detail & Related papers (2025-01-23T18:04:21Z)
QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory [66.01597794579568]
We introduce information bottleneck theory (IB) to model the problem. We propose a cross-attention-based approach to approximate mutual information in IB. Our method achieves a 25% increase in compression rate compared to the state-of-the-art.
arXiv Detail & Related papers (2024-08-20T02:44:45Z)
Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses [76.59021017301127]
We propose a large-scale traffic crash language dataset, named CrashEvent, summarizing 19,340 real-world crash reports. We further formulate the crash event feature learning as a novel text reasoning problem and further fine-tune various large language models (LLMs) to predict detailed accident outcomes. Our experiments results show that our LLM-based approach not only predicts the severity of accidents but also classifies different types of accidents and predicts injury outcomes.
arXiv Detail & Related papers (2024-06-16T03:10:16Z)
Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models [53.50543146583101]
Fine-tuning large language models on small datasets can enhance their performance on specific downstream tasks. Malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors. We propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data.
arXiv Detail & Related papers (2024-06-12T18:33:11Z)
Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides [39.29715168284971]
Service teams compile troubleshooting knowledge into Guides (TSGs) accessible to on-call engineers (OCEs) TSGs are often unstructured and incomplete, which requires manual interpretation by OCEs, leading to on-call fatigue and decreased productivity. We propose Nissist which leverages TSGs and incident mitigation histories to provide proactive suggestions, reducing human intervention.
arXiv Detail & Related papers (2024-02-27T14:14:23Z)
X-lifecycle Learning for Cloud Incident Management using LLMs [18.076347758182067]
Incident management for large cloud services is a complex and tedious process. Recent advancements in large language models [LLMs] created opportunities to automatically generate contextual recommendations. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance.
arXiv Detail & Related papers (2024-02-15T06:19:02Z)
FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z)
Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z)
Graph-based Incident Aggregation for Large-Scale Online Service Systems [33.70557954446136]
We propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations. The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud.
arXiv Detail & Related papers (2021-08-27T08:48:55Z)
DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services [5.418912231064684]
We introduce DeepTriage, an intelligent incident transfer service combining machine learning techniques. For highly impacted incidents, DeepTriage achieves F1 score from 76.3% - 91.3%. DeepTriage has been deployed in Azure since October 2017 and is used by thousands of teams daily.
arXiv Detail & Related papers (2020-11-25T03:10:11Z)
Joint Constrained Learning for Event-Event Relation Extraction [94.3499255880101]
We propose a joint constrained learning framework for modeling event-event relations. Specifically, the framework enforces logical constraints within and across multiple temporal and subevent relations. We show that our joint constrained learning approach effectively compensates for the lack of jointly labeled data.
arXiv Detail & Related papers (2020-10-13T22:45:28Z)
Neural Knowledge Extraction From Cloud Service Incidents [13.86595381172654]
SoftNER is a framework for unsupervised knowledge extraction from service incidents. We build a novel multi-task learning based BiLSTM-CRF model. We show that the unsupervised machine learning based approach has a high precision of 0.96.
arXiv Detail & Related papers (2020-07-10T17:33:07Z)
Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.