Automated Root Causing of Cloud Incidents using In-Context Learning with
GPT-4
- URL: http://arxiv.org/abs/2401.13810v1
- Date: Wed, 24 Jan 2024 21:02:07 GMT
- Title: Automated Root Causing of Cloud Incidents using In-Context Learning with
GPT-4
- Authors: Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma, Yu
Kang, Saravan Rajmohan
- Abstract summary: Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud services.
GPT-4 model's immense size presents challenges when trying to fine-tune it on user data.
We propose an in-context learning approach for automated root causing, which eliminates the need for fine-tuning.
- Score: 23.856839017006386
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis
process for cloud services, requiring on-call engineers to identify the primary
issues and implement corrective actions to prevent future recurrences.
Improving the incident RCA process is vital for minimizing service downtime,
customer impact and manual toil. Recent advances in artificial intelligence
have introduced state-of-the-art Large Language Models (LLMs) like GPT-4, which
have proven effective in tackling various AIOps problems, ranging from code
authoring to incident management. Nonetheless, the GPT-4 model's immense size
presents challenges when trying to fine-tune it on user data because of the
significant GPU resource demand and the necessity for continuous model
fine-tuning with the emergence of new data. To address the high cost of
fine-tuning LLM, we propose an in-context learning approach for automated root
causing, which eliminates the need for fine-tuning. We conduct extensive study
over 100,000 production incidents, comparing several large language models
using multiple metrics. The results reveal that our in-context learning
approach outperforms the previous fine-tuned large language models such as
GPT-3 by an average of 24.8\% across all metrics, with an impressive 49.7\%
improvement over the zero-shot model. Moreover, human evaluation involving
actual incident owners demonstrates its superiority over the fine-tuned model,
achieving a 43.5\% improvement in correctness and an 8.7\% enhancement in
readability. The impressive results demonstrate the viability of utilizing a
vanilla GPT model for the RCA task, thereby avoiding the high computational and
maintenance costs associated with a fine-tuned model.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration [74.09687562334682]
We introduce a novel training data attribution method called Debias and Denoise Attribution (DDA)
Our method significantly outperforms existing approaches, achieving an averaged AUC of 91.64%.
DDA exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.
arXiv Detail & Related papers (2024-10-02T07:14:26Z) - A Comprehensive Evaluation of Large Language Models on Mental Illnesses [0.8458496687170665]
GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets.
prompt engineering played a crucial role in enhancing model performance.
Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
arXiv Detail & Related papers (2024-09-24T02:58:52Z) - A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines.
More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Large language models for aspect-based sentiment analysis [0.0]
We assess the performance of GPT-4 and GPT-3.5 in zero shot, few shot and fine-tuned settings.
Fine-tuned GPT-3.5 achieves a state-of-the-art F1 score of 83.8 on the joint aspect term extraction and polarity classification task.
arXiv Detail & Related papers (2023-10-27T10:03:21Z) - Generalizable Error Modeling for Human Data Annotation: Evidence From an Industry-Scale Search Data Annotation Program [0.0]
This paper presents a predictive error model trained to detect potential errors in search relevance annotation tasks.
We show that errors can be predicted with moderate model performance (AUC=0.65-0.75) and that model performance generalizes well across applications.
We demonstrate the usefulness of the model in the context of auditing, where prioritizing tasks with high predicted error probabilities considerably increases the amount of corrected annotation errors.
arXiv Detail & Related papers (2023-10-08T21:21:19Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z) - RLBoost: Boosting Supervised Models using Deep Reinforcement Learning [0.0]
We present RLBoost, an algorithm that uses deep reinforcement learning strategies to evaluate a particular dataset and obtain a model capable of estimating the quality of any new data.
The results of the article show that this model obtains better and more stable results than other state-of-the-art algorithms such as LOO, DataShapley or DVRL.
arXiv Detail & Related papers (2023-05-23T14:38:33Z) - Recommending Root-Cause and Mitigation Steps for Cloud Incidents using
Large Language Models [18.46643617658214]
On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents.
Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x.
We do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and production incidents.
arXiv Detail & Related papers (2023-01-10T05:41:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.