Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides
- URL: http://arxiv.org/abs/2402.17531v2
- Date: Fri, 10 May 2024 11:57:46 GMT
- Title: Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides
- Authors: Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang,
- Abstract summary: Service teams compile troubleshooting knowledge into Guides (TSGs) accessible to on-call engineers (OCEs)
TSGs are often unstructured and incomplete, which requires manual interpretation by OCEs, leading to on-call fatigue and decreased productivity.
We propose Nissist which leverages TSGs and incident mitigation histories to provide proactive suggestions, reducing human intervention.
- Score: 39.29715168284971
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective incident management is pivotal for the smooth operation of enterprises-level cloud services. In order to expedite incident mitigation, service teams compile troubleshooting knowledge into Troubleshooting Guides (TSGs) accessible to on-call engineers (OCEs). While automated pipelines are enabled to resolve the most frequent and easy incidents, there still exist complex incidents that require OCEs' intervention. However, TSGs are often unstructured and incomplete, which requires manual interpretation by OCEs, leading to on-call fatigue and decreased productivity, especially among new-hire OCEs. In this work, we propose Nissist which leverages TSGs and incident mitigation histories to provide proactive suggestions, reducing human intervention. Leveraging Large Language Models (LLM), Nissist extracts insights from unstructured TSGs and historical incident mitigation discussions, forming a comprehensive knowledge base. Its multi-agent system design enhances proficiency in precisely discerning user queries, retrieving relevant information, and delivering systematic plans consecutively. Through our user case and experiment, we demonstrate that Nissist significant reduce Time to Mitigate (TTM) in incident mitigation, alleviating operational burdens on OCEs and improving service reliability. Our demo is available at https://aka.ms/nissist_demo.
Related papers
- Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation [49.36436704082436]
How-to questions are integral to decision-making processes and require dynamic, step-by-step answers.
We propose Thread, a novel data organization paradigm aimed at enabling current systems to handle how-to questions more effectively.
arXiv Detail & Related papers (2024-06-19T09:14:41Z) - X-lifecycle Learning for Cloud Incident Management using LLMs [18.076347758182067]
Incident management for large cloud services is a complex and tedious process.
Recent advancements in large language models [LLMs] created opportunities to automatically generate contextual recommendations.
In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance.
arXiv Detail & Related papers (2024-02-15T06:19:02Z) - Dependency Aware Incident Linking in Large Cloud Systems [8.797638977934646]
We propose dependency-aware incident linking (DiLink) framework to improve the accuracy and coverage of incident links.
We also propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes.
arXiv Detail & Related papers (2024-02-05T13:54:11Z) - Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in
Temporal Action Localization Tasks [46.94537691205153]
We introduce Soft-Landing (SoLa) strategy to bridge the transferability gap between the pretrained encoder and the downstream tasks.
Our method effectively alleviates the task discrepancy problem with remarkable computational efficiency.
arXiv Detail & Related papers (2022-11-11T06:27:22Z) - AutoTSG: Learning and Synthesis for Incident Troubleshooting [6.297939852772734]
We conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents.
We find that TSGs are widely used and help significantly reduce mitigation efforts.
We propose AutoTSG -- a novel framework for automation of TSGs executable by combining machine learning and program synthesis.
arXiv Detail & Related papers (2022-05-26T16:05:11Z) - Mining Root Cause Knowledge from Cloud Service Incident Investigations
for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes.
In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z) - Reducing Catastrophic Forgetting in Self Organizing Maps with
Internally-Induced Generative Replay [67.50637511633212]
A lifelong learning agent is able to continually learn from potentially infinite streams of pattern sensory data.
One major historic difficulty in building agents that adapt is that neural systems struggle to retain previously-acquired knowledge when learning from new samples.
This problem is known as catastrophic forgetting (interference) and remains an unsolved problem in the domain of machine learning to this day.
arXiv Detail & Related papers (2021-12-09T07:11:14Z) - Graph-based Incident Aggregation for Large-Scale Online Service Systems [33.70557954446136]
We propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures.
A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations.
The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud.
arXiv Detail & Related papers (2021-08-27T08:48:55Z) - Inspect, Understand, Overcome: A Survey of Practical Methods for AI
Safety [54.478842696269304]
The use of deep neural networks (DNNs) in safety-critical applications is challenging due to numerous model-inherent shortcomings.
In recent years, a zoo of state-of-the-art techniques aiming to address these safety concerns has emerged.
Our paper addresses both machine learning experts and safety engineers.
arXiv Detail & Related papers (2021-04-29T09:54:54Z) - Neural Knowledge Extraction From Cloud Service Incidents [13.86595381172654]
SoftNER is a framework for unsupervised knowledge extraction from service incidents.
We build a novel multi-task learning based BiLSTM-CRF model.
We show that the unsupervised machine learning based approach has a high precision of 0.96.
arXiv Detail & Related papers (2020-07-10T17:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.