Related papers: DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services

DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services

URL: http://arxiv.org/abs/2012.03665v1
Date: Wed, 25 Nov 2020 03:10:11 GMT
Title: DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services
Authors: Phuong Pham, Vivek Jain, Lukas Dauterman, Justin Ormont, Navendu Jain
Abstract summary: We introduce DeepTriage, an intelligent incident transfer service combining machine learning techniques. For highly impacted incidents, DeepTriage achieves F1 score from 76.3% - 91.3%. DeepTriage has been deployed in Azure since October 2017 and is used by thousands of teams daily.
Score: 5.418912231064684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As cloud services are growing and generating high revenues, the cost of downtime in these services is becoming significantly expensive. To reduce loss and service downtime, a critical primary step is to execute incident triage, the process of assigning a service incident to the correct responsible team, in a timely manner. An incorrect assignment risks additional incident reroutings and increases its time to mitigate by 10x. However, automated incident triage in large cloud services faces many challenges: (1) a highly imbalanced incident distribution from a large number of teams, (2) wide variety in formats of input data or data sources, (3) scaling to meet production-grade requirements, and (4) gaining engineers' trust in using machine learning recommendations. To address these challenges, we introduce DeepTriage, an intelligent incident transfer service combining multiple machine learning techniques - gradient boosted classifiers, clustering methods, and deep neural networks - in an ensemble to recommend the responsible team to triage an incident. Experimental results on real incidents in Microsoft Azure show that our service achieves 82.9% F1 score. For highly impacted incidents, DeepTriage achieves F1 score from 76.3% - 91.3%. We have applied best practices and state-of-the-art frameworks to scale DeepTriage to handle incident routing for all cloud services. DeepTriage has been deployed in Azure since October 2017 and is used by thousands of teams daily.

Related papers

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure [49.88201789074532]
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning.<n>We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure.
arXiv Detail & Related papers (2025-12-27T11:14:23Z)
X-lifecycle Learning for Cloud Incident Management using LLMs [18.076347758182067]
Incident management for large cloud services is a complex and tedious process. Recent advancements in large language models [LLMs] created opportunities to automatically generate contextual recommendations. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance.
arXiv Detail & Related papers (2024-02-15T06:19:02Z)
Dependency Aware Incident Linking in Large Cloud Systems [8.797638977934646]
We propose dependency-aware incident linking (DiLink) framework to improve the accuracy and coverage of incident links. We also propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes.
arXiv Detail & Related papers (2024-02-05T13:54:11Z)
Towards General and Efficient Online Tuning for Spark [55.30868031221838]
We present a general and efficient Spark tuning framework that can deal with the three issues simultaneously. We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent.
arXiv Detail & Related papers (2023-09-05T02:16:45Z)
Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models [18.46643617658214]
On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x. We do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and production incidents.
arXiv Detail & Related papers (2023-01-10T05:41:40Z)
Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference. This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z)
FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z)
Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z)
Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z)
Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context. We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z)
On Improving Deep Learning Trace Analysis with System Call Arguments [1.3299507495084417]
Kernel traces are sequences of low-level events comprising a name and multiple arguments. We introduce a general approach to learning a representation of the event names along with their arguments using both embedding and encoding.
arXiv Detail & Related papers (2021-03-11T19:26:34Z)
Neural Knowledge Extraction From Cloud Service Incidents [13.86595381172654]
SoftNER is a framework for unsupervised knowledge extraction from service incidents. We build a novel multi-task learning based BiLSTM-CRF model. We show that the unsupervised machine learning based approach has a high precision of 0.96.
arXiv Detail & Related papers (2020-07-10T17:33:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.