DeepTriage: Automated Transfer Assistance for Incidents in Cloud
Services
- URL: http://arxiv.org/abs/2012.03665v1
- Date: Wed, 25 Nov 2020 03:10:11 GMT
- Title: DeepTriage: Automated Transfer Assistance for Incidents in Cloud
Services
- Authors: Phuong Pham, Vivek Jain, Lukas Dauterman, Justin Ormont, Navendu Jain
- Abstract summary: We introduce DeepTriage, an intelligent incident transfer service combining machine learning techniques.
For highly impacted incidents, DeepTriage achieves F1 score from 76.3% - 91.3%.
DeepTriage has been deployed in Azure since October 2017 and is used by thousands of teams daily.
- Score: 5.418912231064684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As cloud services are growing and generating high revenues, the cost of
downtime in these services is becoming significantly expensive. To reduce loss
and service downtime, a critical primary step is to execute incident triage,
the process of assigning a service incident to the correct responsible team, in
a timely manner. An incorrect assignment risks additional incident reroutings
and increases its time to mitigate by 10x. However, automated incident triage
in large cloud services faces many challenges: (1) a highly imbalanced incident
distribution from a large number of teams, (2) wide variety in formats of input
data or data sources, (3) scaling to meet production-grade requirements, and
(4) gaining engineers' trust in using machine learning recommendations. To
address these challenges, we introduce DeepTriage, an intelligent incident
transfer service combining multiple machine learning techniques - gradient
boosted classifiers, clustering methods, and deep neural networks - in an
ensemble to recommend the responsible team to triage an incident. Experimental
results on real incidents in Microsoft Azure show that our service achieves
82.9% F1 score. For highly impacted incidents, DeepTriage achieves F1 score
from 76.3% - 91.3%. We have applied best practices and state-of-the-art
frameworks to scale DeepTriage to handle incident routing for all cloud
services. DeepTriage has been deployed in Azure since October 2017 and is used
by thousands of teams daily.
Related papers
- X-lifecycle Learning for Cloud Incident Management using LLMs [18.076347758182067]
Incident management for large cloud services is a complex and tedious process.
Recent advancements in large language models [LLMs] created opportunities to automatically generate contextual recommendations.
In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance.
arXiv Detail & Related papers (2024-02-15T06:19:02Z) - Dependency Aware Incident Linking in Large Cloud Systems [8.797638977934646]
We propose dependency-aware incident linking (DiLink) framework to improve the accuracy and coverage of incident links.
We also propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes.
arXiv Detail & Related papers (2024-02-05T13:54:11Z) - Towards General and Efficient Online Tuning for Spark [55.30868031221838]
We present a general and efficient Spark tuning framework that can deal with the three issues simultaneously.
We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent.
arXiv Detail & Related papers (2023-09-05T02:16:45Z) - Recommending Root-Cause and Mitigation Steps for Cloud Incidents using
Large Language Models [18.46643617658214]
On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents.
Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x.
We do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and production incidents.
arXiv Detail & Related papers (2023-01-10T05:41:40Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - Mining Root Cause Knowledge from Cloud Service Incident Investigations
for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes.
In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior.
The retrieval process is trained to retrieve information from the dataset that may be useful in the current context.
We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z) - On Improving Deep Learning Trace Analysis with System Call Arguments [1.3299507495084417]
Kernel traces are sequences of low-level events comprising a name and multiple arguments.
We introduce a general approach to learning a representation of the event names along with their arguments using both embedding and encoding.
arXiv Detail & Related papers (2021-03-11T19:26:34Z) - Neural Knowledge Extraction From Cloud Service Incidents [13.86595381172654]
SoftNER is a framework for unsupervised knowledge extraction from service incidents.
We build a novel multi-task learning based BiLSTM-CRF model.
We show that the unsupervised machine learning based approach has a high precision of 0.96.
arXiv Detail & Related papers (2020-07-10T17:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.