LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems
- URL: http://arxiv.org/abs/2509.25874v1
- Date: Tue, 30 Sep 2025 07:11:28 GMT
- Title: LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems
- Authors: Zhihan Jiang, Jinyang Liu, Yichen Li, Haiyu Huang, Xiao He, Tieying Zhang, Jianjun Chen, Yi Li, Rui Shi, Michael R. Lyu,
- Abstract summary: LogPilot is an intent-aware framework powered by Large Language Models (LLMs) for automated log-based alert diagnosis.<n>It reconstructs each request's execution into atemporal log chain, clusters similar chains to identify recurring execution patterns, and provides representative samples to the LLMs for diagnosis.<n> evaluated on real-world alerts from Volcano Engine Cloud, LogPilot improves the usefulness of root cause summarization by 50.34% and exact localization accuracy by 54.79% over state-of-the-art methods.
- Score: 41.55191803277989
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective alert diagnosis is essential for ensuring the reliability of large-scale online service systems. However, on-call engineers are often burdened with manually inspecting massive volumes of logs to identify root causes. While various automated tools have been proposed, they struggle in practice due to alert-agnostic log scoping and the inability to organize complex data effectively for reasoning. To overcome these limitations, we introduce LogPilot, an intent-aware and scalable framework powered by Large Language Models (LLMs) for automated log-based alert diagnosis. LogPilot introduces an intent-aware approach, interpreting the logic in alert definitions (e.g., PromQL) to precisely identify causally related logs and requests. To achieve scalability, it reconstructs each request's execution into a spatiotemporal log chain, clusters similar chains to identify recurring execution patterns, and provides representative samples to the LLMs for diagnosis. This clustering-based approach ensures the input is both rich in diagnostic detail and compact enough to fit within the LLM's context window. Evaluated on real-world alerts from Volcano Engine Cloud, LogPilot improves the usefulness of root cause summarization by 50.34% and exact localization accuracy by 54.79% over state-of-the-art methods. With a diagnosis time under one minute and a cost of only $0.074 per alert, LogPilot has been successfully deployed in production, offering an automated and practical solution for service alert diagnosis.
Related papers
- Cross-System Software Log-based Anomaly Detection Using Meta-Learning [17.39262430769509]
AIOps tools have been developed to automate the process of log-based anomaly detection for software systems.<n>Three practical challenges are widely recognized in this field: high data labeling costs, evolving logs in dynamic systems, and adaptability across different systems.<n>We propose CroSysLog, an AIOps tool for log-event level anomaly detection, specifically designed in response to these challenges.
arXiv Detail & Related papers (2024-12-19T22:55:45Z) - Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis [29.800380941293277]
Engineers prioritize two categories of log information for diagnosis: fault-indicating descriptions and fault-indicating parameters.
We propose an approach to automatically extract faultindicating information from logs for fault diagnosis, named LoFI.
LoFI outperforms all baseline methods by a significant margin, achieving an absolute improvement of 25.837.9 in F1 over the best baseline method, ChatGPT.
arXiv Detail & Related papers (2024-09-20T15:00:47Z) - LogRCA: Log-based Root Cause Analysis for Distributed Services [4.049637286678329]
We propose LogRCA, a novel method for identifying a minimal set of log lines that together describe a root cause.
LogRCA uses a semi-supervised learning approach to deal with rare and unknown errors and is designed to handle noisy data.
We evaluated our approach on a large-scale production log data set of 44.3 million log lines, which contains 80 failures, whose root causes were labeled by experts.
arXiv Detail & Related papers (2024-05-22T12:50:56Z) - LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection [73.69399219776315]
We propose a unified Transformer-based framework for Log anomaly detection (LogFormer) to improve the generalization ability across different domains.
Specifically, our model is first pre-trained on the source domain to obtain shared semantic knowledge of log data.
Then, we transfer such knowledge to the target domain via shared parameters.
arXiv Detail & Related papers (2024-01-09T12:55:21Z) - Data Drift Monitoring for Log Anomaly Detection Pipelines [2.941832525496684]
We introduce a Bayes Factor-based drift detection method that identifies when intervention, retraining, and updating of the LAD model are required with human involvement.
We illustrate our method using sequences of log activity, both from unaltered data, and simulated activity with controlled levels of anomaly contamination.
arXiv Detail & Related papers (2023-10-17T09:10:40Z) - PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning [58.85063149619348]
We propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows.
Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets.
arXiv Detail & Related papers (2023-01-25T16:34:43Z) - Leveraging Log Instructions in Log-based Anomaly Detection [0.5949779668853554]
We propose a method for reliable and practical anomaly detection from system logs.
It overcomes the common disadvantage of related works by building an anomaly detection model with log instructions from the source code of 1000+ GitHub projects.
The proposed method, named ADLILog, combines the log instructions and the data from the system of interest (target system) to learn a deep neural network model.
arXiv Detail & Related papers (2022-07-07T10:22:10Z) - LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak
Supervision [63.08516384181491]
We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts.
Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect.
Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
arXiv Detail & Related papers (2021-11-02T15:16:08Z) - A2Log: Attentive Augmented Log Anomaly Detection [53.06341151551106]
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services.
Existing unsupervised methods need anomaly examples to obtain a suitable decision boundary.
We develop A2Log, which is an unsupervised anomaly detection method consisting of two steps: Anomaly scoring and anomaly decision.
arXiv Detail & Related papers (2021-09-20T13:40:21Z) - Robust and Transferable Anomaly Detection in Log Data using Pre-Trained
Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users.
We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.