AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems
- URL: http://arxiv.org/abs/2601.14912v1
- Date: Wed, 21 Jan 2026 11:51:59 GMT
- Title: AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems
- Authors: Guangba Yu, Genting Mai, Rui Wang, Ruipeng Li, Pengfei Chen, Long Pan, Ruijie Xu,
- Abstract summary: AlertGuardian is a framework collaborating large language models (LLMs) and lightweight graph models to optimize the alert life-cycle through three phases.<n>It significantly mitigates alert fatigue (94.8% alert reduction ratios) and accelerates fault diagnosis (90.5% diagnosis accuracy)<n>It also improves 1,174 alert rules, with 375 accepted by SREs (32% acceptance rate)
- Score: 9.607190519952466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Alerts are critical for detecting anomalies in large-scale cloud systems, ensuring reliability and user experience. However, current systems generate overwhelming volumes of alerts, degrading operational efficiency due to ineffective alert life-cycle management. This paper details the efforts of Company-X to optimize alert life-cycle management, addressing alert fatigue in cloud systems. We propose AlertGuardian, a framework collaborating large language models (LLMs) and lightweight graph models to optimize the alert life-cycle through three phases: Alert Denoise uses graph learning model with virtual noise to filter noise, Alert Summary employs Retrieval Augmented Generation (RAG) with LLMs to create actionable summary, and Alert Rule Refinement leverages multi-agent iterative feedbacks to improve alert rule quality. Evaluated on four real-world datasets from Company-X's services, AlertGuardian significantly mitigates alert fatigue (94.8\% alert reduction ratios) and accelerates fault diagnosis (90.5\% diagnosis accuracy). Moreover, AlertGuardian improves 1,174 alert rules, with 375 accepted by SREs (32% acceptance rate). Finally, we share success stories and lessons learned about alert life-cycle management after the deployment of AlertGuardian in Company-X.
Related papers
- AlertBERT: A noise-robust alert grouping framework for simultaneous cyber attacks [3.0540687763044123]
High numbers of security alerts issued by intrusion detection systems lead to alert fatigue among analysts.<n>Time-based alert grouping solutions are unsuitable for large scale computer networks characterised by high levels of false positive alerts and simultaneously occurring attacks.<n>We propose AlertBERT, a self-language framework designed to group alerts from isolated or concurrent attacks in noisy environments.
arXiv Detail & Related papers (2026-02-06T09:39:47Z) - The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - A Graph-Based Approach to Alert Contextualisation in Security Operations Centres [0.058633603884542605]
This paper proposes a graph-based approach to enhance alert contextualisation in a SOC by aggregating alerts into graph-based alert groups.<n>By grouping related alerts, we enable analysis at a higher abstraction level, capturing attack steps more effectively than individual alerts.<n>To show that our format is well suited for downstream machine learning methods, we employ Graph Matching Networks (GMNs) to correlate incoming alert groups with historical incidents.
arXiv Detail & Related papers (2025-09-16T10:20:39Z) - WebGuard: Building a Generalizable Guardrail for Web Agents [59.31116061613742]
WebGuard is the first dataset designed to support the assessment of web agent action risks.<n>It contains 4,939 human-annotated actions from 193 websites across 22 diverse domains.
arXiv Detail & Related papers (2025-07-18T18:06:27Z) - Automated Alert Classification and Triage (AACT): An Intelligent System for the Prioritisation of Cybersecurity Alerts [0.0]
AACT learns from analysts' triage actions on cybersecurity alerts.<n>It accurately predicts triage decisions in real time.<n>This reduces the SOC queue allowing analysts to focus on the most severe, relevant or ambiguous threats.
arXiv Detail & Related papers (2025-05-14T23:02:32Z) - Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z) - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.<n>WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z) - Carbon Filter: Real-time Alert Triage Using Large Scale Clustering and Fast Search [6.830322979559498]
"Alert fatigue" is one of the biggest challenges faced by the Security Operations Center (SOC) today.
We present Carbon Filter, a statistical learning based system that dramatically reduces the number of alerts analysts need to manually review.
arXiv Detail & Related papers (2024-05-07T22:06:24Z) - Augment and Criticize: Exploring Informative Samples for Semi-Supervised
Monocular 3D Object Detection [64.65563422852568]
We improve the challenging monocular 3D object detection problem with a general semi-supervised framework.
We introduce a novel, simple, yet effective Augment and Criticize' framework that explores abundant informative samples from unlabeled data.
The two new detectors, dubbed 3DSeMo_DLE and 3DSeMo_FLEX, achieve state-of-the-art results with remarkable improvements for over 3.5% AP_3D/BEV (Easy) on KITTI.
arXiv Detail & Related papers (2023-03-20T16:28:15Z) - Sample-Efficient Safety Assurances using Conformal Prediction [57.92013073974406]
Early warning systems can provide alerts when an unsafe situation is imminent.
To reliably improve safety, these warning systems should have a provable false negative rate.
We present a framework that combines a statistical inference technique known as conformal prediction with a simulator of robot/environment dynamics.
arXiv Detail & Related papers (2021-09-28T23:00:30Z) - Moving Metric Detection and Alerting System at eBay [4.778341933013294]
At eBay, there are thousands of product health metrics for different domain teams to monitor.
We built a two-phase alerting system to notify users with actionable alerts based on anomaly detection and alert retrieval.
arXiv Detail & Related papers (2020-04-06T00:28:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.