A Survey of AIOps for Failure Management in the Era of Large Language Models
- URL: http://arxiv.org/abs/2406.11213v4
- Date: Mon, 24 Jun 2024 02:39:17 GMT
- Title: A Survey of AIOps for Failure Management in the Era of Large Language Models
- Authors: Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S. Yu, Ying Li,
- Abstract summary: This paper presents a comprehensive survey of AIOps technology for failure management in the LLM era.
It includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps.
- Score: 60.59720351854515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As software systems grow increasingly intricate, Artificial Intelligence for IT Operations (AIOps) methods have been widely used in software system failure management to ensure the high availability and reliability of large-scale distributed software systems. However, these methods still face several challenges, such as lack of cross-platform generality and cross-task flexibility. Fortunately, recent advancements in large language models (LLMs) can significantly address these challenges, and many approaches have already been proposed to explore this field. However, there is currently no comprehensive survey that discusses the differences between LLM-based AIOps and traditional AIOps methods. Therefore, this paper presents a comprehensive survey of AIOps technology for failure management in the LLM era. It includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps. Additionally, this survey explores the AIOps subtasks, the specific LLM-based approaches suitable for different AIOps subtasks, and the challenges and future directions of the domain, aiming to further its development and application.
Related papers
- EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty.
We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications.
Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z) - ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [39.606908488885125]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)
It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.
Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z) - Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.
We devise a series of experiments to empirically explain the performance gap.
arXiv Detail & Related papers (2024-09-27T05:06:43Z) - Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently.
Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting.
We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z) - AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review [0.29998889086656577]
This study proposes an AIOps terminology and taxonomy, establishing a structured incident management procedure and providing guidelines for constructing an AIOps framework.
The goal is to provide a comprehensive review of technical and research aspects in AIOps for incident management, aiming to structure knowledge, identify gaps, and establish a foundation for future developments in the field.
arXiv Detail & Related papers (2024-04-01T17:32:22Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Solution-oriented Agent-based Models Generation with Verifier-assisted
Iterative In-context Learning [10.67134969207797]
Agent-based models (ABMs) stand as an essential paradigm for proposing and validating hypothetical solutions or policies.
Large language models (LLMs) encapsulating cross-domain knowledge and programming proficiency could potentially alleviate the difficulty of this process.
We present SAGE, a general solution-oriented ABM generation framework designed for automatic modeling and generating solutions for targeted problems.
arXiv Detail & Related papers (2024-02-04T07:59:06Z) - LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z) - A Systematic Mapping Study in AIOps [0.6977626480948161]
We conduct an in-depth mapping study to collect and organize the numerous scattered contributions to AIOps.
We create an AIOps taxonomy to build a foundation for future contributions.
Our results show a recent and growing interest towards AIOps, specifically to those contributions treating failure-related tasks.
arXiv Detail & Related papers (2020-12-15T09:05:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.