Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid
Approach
- URL: http://arxiv.org/abs/2403.06485v1
- Date: Mon, 11 Mar 2024 07:48:35 GMT
- Title: Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid
Approach
- Authors: Jinxi Kuang, Jinyang Liu, Junjie Huang, Renyi Zhong, Jiazhen Gu, Lan
Yu, Rui Tan, Zengyin Yang, Michael R. Lyu
- Abstract summary: COLA is a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation.
We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform.
Results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency.
- Score: 28.71225642605041
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the scale and complexity of cloud systems, a system failure would
trigger an "alert storm", i.e., massive correlated alerts. Although these
alerts can be traced back to a few root causes, the overwhelming number makes
it infeasible for manual handling. Alert aggregation is thus critical to help
engineers concentrate on the root cause and facilitate failure resolution.
Existing methods typically utilize semantic similarity-based methods or
statistical methods to aggregate alerts. However, semantic similarity-based
methods overlook the causal rationale of alerts, while statistical methods can
hardly handle infrequent alerts.
To tackle these limitations, we introduce leveraging external knowledge,
i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose
COLA, a novel hybrid approach based on correlation mining and LLM (Large
Language Model) reasoning for online alert aggregation. The correlation mining
module effectively captures the temporal and spatial relations between alerts,
measuring their correlations in an efficient manner. Subsequently, only
uncertain pairs with low confidence are forwarded to the LLM reasoning module
for detailed analysis. This hybrid design harnesses both statistical evidence
for frequent alerts and the reasoning capabilities of computationally intensive
LLMs, ensuring the overall efficiency of COLA in handling large volumes of
alerts in practical scenarios. We evaluate COLA on three datasets collected
from the production environment of a large-scale cloud platform. The
experimental results show COLA achieves F1-scores from 0.901 to 0.930,
outperforming state-of-the-art methods and achieving comparable efficiency. We
also share our experience in deploying COLA in our real-world cloud system,
Cloud X.
Related papers
- Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks [52.46420522934253]
We introduce LoRA-Ensemble, a parameter-efficient deep ensemble method for self-attention networks.
By employing a single pre-trained self-attention network with weights shared across all members, we train member-specific low-rank matrices for the attention projections.
Our method exhibits superior calibration compared to explicit ensembles and achieves similar or better accuracy across various prediction tasks and datasets.
arXiv Detail & Related papers (2024-05-23T11:10:32Z) - FedRDF: A Robust and Dynamic Aggregation Function against Poisoning
Attacks in Federated Learning [0.0]
Federated Learning (FL) represents a promising approach to typical privacy concerns associated with centralized Machine Learning (ML) deployments.
Despite its well-known advantages, FL is vulnerable to security attacks such as Byzantine behaviors and poisoning attacks.
Our proposed approach was tested against various model poisoning attacks, demonstrating superior performance over state-of-the-art aggregation methods.
arXiv Detail & Related papers (2024-02-15T16:42:04Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Privacy-Preserving Distributed Learning for Residential Short-Term Load
Forecasting [11.185176107646956]
Power system load data can inadvertently reveal the daily routines of residential users, posing a risk to their property security.
We introduce a Markovian Switching-based distributed training framework, the convergence of which is substantiated through rigorous theoretical analysis.
Case studies employing real-world power system load data validate the efficacy of our proposed algorithm.
arXiv Detail & Related papers (2024-02-02T16:39:08Z) - FreqFed: A Frequency Analysis-Based Approach for Mitigating Poisoning
Attacks in Federated Learning [98.43475653490219]
Federated learning (FL) is susceptible to poisoning attacks.
FreqFed is a novel aggregation mechanism that transforms the model updates into the frequency domain.
We demonstrate that FreqFed can mitigate poisoning attacks effectively with a negligible impact on the utility of the aggregated model.
arXiv Detail & Related papers (2023-12-07T16:56:24Z) - A Hierarchical Security Events Correlation Model for Real-time Cyber Threat Detection and Response [0.0]
We develop a novel hierarchical event correlation model that promises to reduce the number of alerts issued by an Intrusion Detection System.
The proposed model takes the best of features from similarity and graph-based correlation techniques to deliver an ensemble capability not possible by either approach separately.
The model is implemented as a proof of concept with experiments run on the DARPA 99 Intrusion detection set.
arXiv Detail & Related papers (2023-12-02T20:07:40Z) - Over-the-Air Federated Learning and Optimization [52.5188988624998]
We focus on Federated learning (FL) via edge-the-air computation (AirComp)
We describe the convergence of AirComp-based FedAvg (AirFedAvg) algorithms under both convex and non- convex settings.
For different types of local updates that can be transmitted by edge devices (i.e., model, gradient, model difference), we reveal that transmitting in AirFedAvg may cause an aggregation error.
In addition, we consider more practical signal processing schemes to improve the communication efficiency and extend the convergence analysis to different forms of model aggregation error caused by these signal processing schemes.
arXiv Detail & Related papers (2023-10-16T05:49:28Z) - PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation
with GPT-4 in Cloud Incident Root Cause Analysis [17.362895895214344]
Large language models (LLMs) are used to help humans identify the root causes of cloud incidents.
We propose to perform confidence estimation for the predictions to help on-call engineers make decisions on whether to adopt the model prediction.
We show that our method is able to produce calibrated confidence estimates for predicted root causes, validate the usefulness of retrieved historical data and the prompting strategy.
arXiv Detail & Related papers (2023-09-11T21:24:00Z) - Inter-Domain Fusion for Enhanced Intrusion Detection in Power Systems:
An Evidence Theoretic and Meta-Heuristic Approach [0.0]
False alerts due to/ compromised IDS in ICS networks can lead to severe economic and operational damage.
This work presents an approach for reducing false alerts in CPS power systems by dealing with uncertainty without prior distribution of alerts.
arXiv Detail & Related papers (2021-11-20T00:05:39Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.