DiagNet: towards a generic, Internet-scale root cause analysis solution
- URL: http://arxiv.org/abs/2004.03343v1
- Date: Tue, 7 Apr 2020 13:21:32 GMT
- Title: DiagNet: towards a generic, Internet-scale root cause analysis solution
- Authors: Lo\"ick Bonniot (WIDE), Christoph Neumann, Fran\c{c}ois Ta\"iani
(WIDE)
- Abstract summary: We show how different machine learning techniques can be used for Internet-scale root cause analysis.
Our solution, DiagNet, adapts concepts from image processing research to handle network and system metrics.
We demonstrate promising root cause analysis capabilities, with a recall of 73.9% including causes only being introduced at inference time.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diagnosing problems in Internet-scale services remains particularly difficult
and costly for both content providers and ISPs. Because the Internet is
decentralized, the cause of such problems might lie anywhere between an
end-user's device and the service datacenters. Further, the set of possible
problems and causes is not known in advance, making it impossible in practice
to train a classifier with all combinations of problems, causes and locations.
In this paper, we explore how different machine learning techniques can be used
for Internet-scale root cause analysis using measurements taken from end-user
devices. We show how to build generic models that (i) are agnostic to the
underlying network topology, (ii) do not require to define the full set of
possible causes during training, and (iii) can be quickly adapted to diagnose
new services. Our solution, DiagNet, adapts concepts from image processing
research to handle network and system metrics. We evaluate DiagNet with a
multi-cloud deployment of online services with injected faults and emulated
clients with automated browsers. We demonstrate promising root cause analysis
capabilities, with a recall of 73.9% including causes only being introduced at
inference time.
Related papers
- Automated Root Cause Analysis System for Complex Data Products [1.7458548956314806]
We present ARCAS (Automated Root Cause Analysis System), a diagnostic platform built for fast diagnostic implementation and low learning curve.
Arcas is composed of a constellation of automated troubleshooting guides (Auto-TSGs) that can execute in parallel to detect issues using product telemetry and apply mitigation in near-real-time.
arXiv Detail & Related papers (2024-12-19T20:10:54Z) - Don't Treat the Symptom, Find the Cause! Efficient
Artificial-Intelligence Methods for (Interactive) Debugging [0.0]
In the modern world, we are permanently using, leveraging, interacting with, and relying upon systems of ever higher sophistication.
In this thesis, we will give an introduction to the topic of model-based diagnosis, point out the major challenges in the field, and discuss a selection of approaches from our research addressing these issues.
arXiv Detail & Related papers (2023-06-22T12:44:49Z) - Enabling Inter-organizational Analytics in Business Networks Through
Meta Machine Learning [0.0]
Fear of disclosing sensitive information as well as the sheer volume of the data that would need to be exchanged are key inhibitors for the creation of effective system-wide solutions.
We propose a meta machine learning method that deals with these obstacles to enable comprehensive analyses within a business network.
arXiv Detail & Related papers (2023-03-28T09:06:28Z) - Learning to Detect Critical Nodes in Sparse Graphs via Feature Importance Awareness [53.351863569314794]
The critical node problem (CNP) aims to find a set of critical nodes from a network whose deletion maximally degrades the pairwise connectivity of the residual network.
This work proposes a feature importance-aware graph attention network for node representation.
It combines it with dueling double deep Q-network to create an end-to-end algorithm to solve CNP for the first time.
arXiv Detail & Related papers (2021-12-03T14:23:05Z) - A2Log: Attentive Augmented Log Anomaly Detection [53.06341151551106]
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services.
Existing unsupervised methods need anomaly examples to obtain a suitable decision boundary.
We develop A2Log, which is an unsupervised anomaly detection method consisting of two steps: Anomaly scoring and anomaly decision.
arXiv Detail & Related papers (2021-09-20T13:40:21Z) - Analyzing Machine Learning Approaches for Online Malware Detection in
Cloud [0.0]
We present online malware detection based on process level performance metrics and analyze the effectiveness of different machine learning models.
Our analysis conclude that neural network models can most accurately detect the malware that have on the process level features of virtual machines in the cloud.
arXiv Detail & Related papers (2021-05-19T17:28:12Z) - Machine Learning for Massive Industrial Internet of Things [69.52379407906017]
Industrial Internet of Things (IIoT) revolutionizes the future manufacturing facilities by integrating the Internet of Things technologies into industrial settings.
With the deployment of massive IIoT devices, it is difficult for the wireless network to support the ubiquitous connections with diverse quality-of-service (QoS) requirements.
We first summarize the requirements of the typical massive non-critical and critical IIoT use cases. We then identify unique characteristics in the massive IIoT scenario, and the corresponding machine learning solutions with its limitations and potential research directions.
arXiv Detail & Related papers (2021-03-10T20:10:53Z) - Anytime Diagnosis for Reconfiguration [52.77024349608834]
We introduce and analyze FlexDiag which is an anytime direct diagnosis approach.
We evaluate the algorithm with regard to performance and diagnosis quality using a configuration benchmark from the domain of feature models and an industrial configuration knowledge base from the automotive domain.
arXiv Detail & Related papers (2021-02-19T11:45:52Z) - Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments.
It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z) - Tomography Based Learning for Load Distribution through Opaque Networks [9.923523030849836]
Key task for over-the-top (OTT) service providers is sending traffic through the networks to minimize delays.
We consider this problem in a general setting where traffic sources can choose a set of ingresses through which their traffic enter a black box network.
Key technical challenges to solving this problem include the high dimensionality of the problem and handling constraints that are intrinsic to networks.
arXiv Detail & Related papers (2020-07-18T21:52:21Z) - Neuromorphic AI Empowered Root Cause Analysis of Faults in Emerging
Networks [3.710841042000923]
We propose an AI-based fault diagnosis solution that offers a key step towards a completely automated self-healing system.
We compare the performance of the proposed solution against state-of-the-art solution in literature.
Results show that neuromorphic computing model achieves high classification accuracy as compared to the other models.
arXiv Detail & Related papers (2020-05-04T13:26:56Z) - Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G
Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC.
To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.