Assess and Summarize: Improve Outage Understanding with Large Language
Models
- URL: http://arxiv.org/abs/2305.18084v1
- Date: Mon, 29 May 2023 13:36:19 GMT
- Title: Assess and Summarize: Improve Outage Understanding with Large Language
Models
- Authors: Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun
Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, Shilin He, Federica Sarro,
Yingnong Dang, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang
- Abstract summary: We present and empirically validate a novel approach (dubbed Oasis) to help the engineers in this task.
Oasis is able to automatically assess the impact scope of outages as well as to produce human-readable summarization.
Results show that Oasis can effectively and efficiently summarize outages, and lead Microsoft to deploy its first prototype.
- Score: 45.39343325427484
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cloud systems have become increasingly popular in recent years due to their
flexibility and scalability. Each time cloud computing applications and
services hosted on the cloud are affected by a cloud outage, users can
experience slow response times, connection issues or total service disruption,
resulting in a significant negative business impact. Outages are usually
comprised of several concurring events/source causes, and therefore
understanding the context of outages is a very challenging yet crucial first
step toward mitigating and resolving outages. In current practice, on-call
engineers with in-depth domain knowledge, have to manually assess and summarize
outages when they happen, which is time-consuming and labor-intensive. In this
paper, we first present a large-scale empirical study investigating the way
on-call engineers currently deal with cloud outages at Microsoft, and then
present and empirically validate a novel approach (dubbed Oasis) to help the
engineers in this task. Oasis is able to automatically assess the impact scope
of outages as well as to produce human-readable summarization. Specifically,
Oasis first assesses the impact scope of an outage by aggregating relevant
incidents via multiple techniques. Then, it generates a human-readable summary
by leveraging fine-tuned large language models like GPT-3.x. The impact
assessment component of Oasis was introduced in Microsoft over three years ago,
and it is now widely adopted, while the outage summarization component has been
recently introduced, and in this article we present the results of an empirical
evaluation we carried out on 18 real-world cloud systems as well as a
human-based evaluation with outage owners. The results show that Oasis can
effectively and efficiently summarize outages, and lead Microsoft to deploy its
first prototype which is currently under experimental adoption by some of the
incident teams.
Related papers
- Deep Learning-based 3D Point Cloud Classification: A Systematic Survey
and Outlook [12.014972829130764]
This paper introduces point cloud acquisition, characteristics, and challenges.
We review 3D data representations, storage formats, and commonly used datasets for point cloud classification.
arXiv Detail & Related papers (2023-11-05T09:28:43Z) - Cloud-Native Computing: A Survey from the Perspective of Services [41.25934971576225]
Cloud-native computing is the most influential development principle for web applications.
This paper surveys key issues during the life-cycle of cloud-native applications from the perspective of services.
arXiv Detail & Related papers (2023-06-26T03:32:35Z) - A Survey of Label-Efficient Deep Learning for 3D Point Clouds [109.07889215814589]
This paper presents the first comprehensive survey of label-efficient learning of point clouds.
We propose a taxonomy that organizes label-efficient learning methods based on the data prerequisites provided by different types of labels.
For each approach, we outline the problem setup and provide an extensive literature review that showcases relevant progress and challenges.
arXiv Detail & Related papers (2023-05-31T12:54:51Z) - Recommending Root-Cause and Mitigation Steps for Cloud Incidents using
Large Language Models [18.46643617658214]
On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents.
Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x.
We do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and production incidents.
arXiv Detail & Related papers (2023-01-10T05:41:40Z) - Measuring the Carbon Intensity of AI in Cloud Instances [91.28501520271972]
We provide a framework for measuring software carbon intensity, and propose to measure operational carbon emissions.
We evaluate a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform.
arXiv Detail & Related papers (2022-06-10T17:04:04Z) - A Survey on Machine Learning for Geo-Distributed Cloud Data Center
Management [4.226118870861363]
Cloud service providers have been distributing data centers globally to reduce operating costs and improve quality of service.
Such large scale and complex orchestration of software workload and hardware resources remains a difficult problem to solve efficiently.
We review the state-of-the-art Machine Learning techniques for the cloud data center management problem.
arXiv Detail & Related papers (2022-05-17T03:14:54Z) - Unsupervised Point Cloud Representation Learning with Deep Neural
Networks: A Survey [104.71816962689296]
Unsupervised point cloud representation learning has attracted increasing attention due to the constraint in large-scale point cloud labelling.
This paper provides a comprehensive review of unsupervised point cloud representation learning using deep neural networks.
arXiv Detail & Related papers (2022-02-28T07:46:05Z) - Edge-Cloud Polarization and Collaboration: A Comprehensive Survey [61.05059817550049]
We conduct a systematic review for both cloud and edge AI.
We are the first to set up the collaborative learning mechanism for cloud and edge modeling.
We discuss potentials and practical experiences of some on-going advanced edge AI topics.
arXiv Detail & Related papers (2021-11-11T05:58:23Z) - Auto-Split: A General Framework of Collaborative Edge-Cloud AI [49.750972428032355]
This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud.
To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.
arXiv Detail & Related papers (2021-08-30T08:03:29Z) - Anomaly Detection in a Large-scale Cloud Platform [9.283888139549067]
Cloud computing is ubiquitous: more and more companies are moving the workloads into the Cloud.
Service providers need to monitor the quality of their ever-growing offerings effectively.
We designed and implemented an automated monitoring system for the IBM Cloud Platform.
arXiv Detail & Related papers (2020-10-21T12:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.