Related papers: United We Stand: Towards End-to-End Log-based Fault Diagnosis via Interactive Multi-Task Learning

United We Stand: Towards End-to-End Log-based Fault Diagnosis via Interactive Multi-Task Learning

URL: http://arxiv.org/abs/2509.24364v1
Date: Mon, 29 Sep 2025 07:03:23 GMT
Title: United We Stand: Towards End-to-End Log-based Fault Diagnosis via Interactive Multi-Task Learning
Authors: Minghua He, Chiming Duan, Pei Xiao, Tong Jia, Siyu Yu, Lingzhe Zhang, Weijie Hong, Jin Han, Yifan Wu, Ying Li, Gang Huang,
Abstract summary: Chimera is a novel end-to-end log-based fault diagnosis method.<n>It bridges the gap between anomaly detection and root cause localization.<n>It has been successfully deployed in production, serving an industrial cloud platform.
Score: 21.286258482234338
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Log-based fault diagnosis is essential for maintaining software system availability. However, existing fault diagnosis methods are built using a task-independent manner, which fails to bridge the gap between anomaly detection and root cause localization in terms of data form and diagnostic objectives, resulting in three major issues: 1) Diagnostic bias accumulates in the system; 2) System deployment relies on expensive monitoring data; 3) The collaborative relationship between diagnostic tasks is overlooked. Facing this problems, we propose a novel end-to-end log-based fault diagnosis method, Chimera, whose key idea is to achieve end-to-end fault diagnosis through bidirectional interaction and knowledge transfer between anomaly detection and root cause localization. Chimera is based on interactive multi-task learning, carefully designing interaction strategies between anomaly detection and root cause localization at the data, feature, and diagnostic result levels, thereby achieving both sub-tasks interactively within a unified end-to-end framework. Evaluation on two public datasets and one industrial dataset shows that Chimera outperforms existing methods in both anomaly detection and root cause localization, achieving improvements of over 2.92% - 5.00% and 19.01% - 37.09%, respectively. It has been successfully deployed in production, serving an industrial cloud platform.

Related papers

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents [12.160412894251406]
MicroRCA-Agent is an innovative solution for microservice root cause analysis based on large language model agents.<n>The proposed solution demonstrates superior performance in complex microservice fault scenarios, achieving a final score of 50.71.
arXiv Detail & Related papers (2025-09-19T05:57:03Z)
How Execution Features Relate to Failures: An Empirical Study and Diagnosis Approach [11.857060911501016]
Fault localization aims to identify code regions likely responsible for failures.<n>Traditional techniques primarily correlate statement execution with failures.<n>We analyzed 17 execution features and assessed their correlation with failure outcomes.
arXiv Detail & Related papers (2025-02-25T22:00:05Z)
FaultExplainer: Leveraging Large Language Models for Interpretable Fault Detection and Diagnosis [7.161558367924948]
This paper presents FaultExplainer, an interactive tool designed to improve fault detection, diagnosis, and explanation in the Tennessee Eastman Process (TEP)<n>FaultExplainer integrates real-time sensor data visualization, Principal Component Analysis (PCA)-based fault detection, and identification of top contributing variables within an interactive user interface powered by large language models (LLMs)<n>We evaluate the LLMs' reasoning capabilities in two scenarios: one where historical root causes are provided, and one where they are not to mimic the challenge of previously unseen faults.
arXiv Detail & Related papers (2024-12-19T03:35:06Z)
Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z)
Generalized Out-of-distribution Fault Diagnosis (GOOFD) via Internal Contrastive Learning [8.583116999933731]
We propose a Generalized Out-of-distribution Fault Diagnosis framework to integrate diagnosis subtasks. A unified fault diagnosis method based on internal contrastive learning and Mahalanobis distance is put forward to underpin the proposed framework. Our proposed method can be applied to multiple faults diagnosis tasks and achieve better performance than the existing single-task methods.
arXiv Detail & Related papers (2023-06-27T07:50:25Z)
Interactive System-wise Anomaly Detection [66.3766756452743]
Anomaly detection plays a fundamental role in various applications. It is challenging for existing methods to handle the scenarios where the instances are systems whose characteristics are not readily observed as data. We develop an end-to-end approach which includes an encoder-decoder module that learns system embeddings.
arXiv Detail & Related papers (2023-04-21T02:20:24Z)
PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning [58.85063149619348]
We propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets.
arXiv Detail & Related papers (2023-01-25T16:34:43Z)
Causality-Based Multivariate Time Series Anomaly Detection [63.799474860969156]
We formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data. We then propose a causality-based anomaly detection approach, which first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism. We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications.
arXiv Detail & Related papers (2022-06-30T06:00:13Z)
A2Log: Attentive Augmented Log Anomaly Detection [53.06341151551106]
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services. Existing unsupervised methods need anomaly examples to obtain a suitable decision boundary. We develop A2Log, which is an unsupervised anomaly detection method consisting of two steps: Anomaly scoring and anomaly decision.
arXiv Detail & Related papers (2021-09-20T13:40:21Z)
An Explainable Artificial Intelligence Approach for Unsupervised Fault Detection and Diagnosis in Rotating Machinery [2.055054374525828]
This paper proposes a new approach for fault detection and diagnosis in rotating machinery. The methodology consists of three parts: feature extraction, fault detection and fault diagnosis. The effectiveness of the proposed approach is shown on three datasets containing different mechanical faults.
arXiv Detail & Related papers (2021-02-23T18:28:18Z)
Inheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis [50.15205065710629]
Clinical diagnosis, which aims to assign diagnosis codes for a patient based on the clinical note, plays an essential role in clinical decision-making. We propose a novel framework to combine the inheritance-guided hierarchical assignment and co-occurrence graph propagation for clinical automatic diagnosis.
arXiv Detail & Related papers (2021-01-27T13:16:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.