Walk the Talk: Is Your Log-based Software Reliability Maintenance System Really Reliable?
- URL: http://arxiv.org/abs/2509.24352v1
- Date: Mon, 29 Sep 2025 06:52:40 GMT
- Title: Walk the Talk: Is Your Log-based Software Reliability Maintenance System Really Reliable?
- Authors: Minghua He, Tong Jia, Chiming Duan, Pei Xiao, Lingzhe Zhang, Kangjin Wang, Yifan Wu, Ying Li, Gang Huang,
- Abstract summary: This paper defines a trustworthiness metric, diagnostic faithfulness, for models to gain service providers' trust.<n>We propose FaithLog, a faithful log-based anomaly detection system, which achieves faithfulness through a carefully designed causality-guided attention mechanism and adversarial consistency learning.
- Score: 18.587739647424716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Log-based software reliability maintenance systems are crucial for sustaining stable customer experience. However, existing deep learning-based methods represent a black box for service providers, making it impossible for providers to understand how these methods detect anomalies, thereby hindering trust and deployment in real production environments. To address this issue, this paper defines a trustworthiness metric, diagnostic faithfulness, for models to gain service providers' trust, based on surveys of SREs at a major cloud provider. We design two evaluation tasks: attention-based root cause localization and event perturbation. Empirical studies demonstrate that existing methods perform poorly in diagnostic faithfulness. Consequently, we propose FaithLog, a faithful log-based anomaly detection system, which achieves faithfulness through a carefully designed causality-guided attention mechanism and adversarial consistency learning. Evaluation results on two public datasets and one industrial dataset demonstrate that the proposed method achieves state-of-the-art performance in diagnostic faithfulness.
Related papers
- Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation [97.36081721024728]
We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations.<n>Our benchmark unifies three types of medical data for open-ended diagnostic generation.<n>We present MedConf, an evidence-grounded linguistic self-assessment framework.
arXiv Detail & Related papers (2026-01-22T04:51:39Z) - Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs [50.075587392477935]
We conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems.<n>Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack.
arXiv Detail & Related papers (2026-01-20T06:42:56Z) - Reliable and Reproducible Demographic Inference for Fairness in Face Analysis [63.46525489354455]
We propose a fully reproducible DAI pipeline that replaces conventional end-to-end training with a modular transfer learning approach.<n>We audit this pipeline across three dimensions: accuracy, fairness, and a newly introduced notion of robustness, defined via intra-identity consistency.<n>Our results show that the proposed method outperforms strong baselines, particularly on ethnicity, which is the more challenging attribute.
arXiv Detail & Related papers (2025-10-23T12:22:02Z) - RationAnomaly: Log Anomaly Detection with Rationality via Chain-of-Thought and Reinforcement Learning [27.235259453535537]
RationAnomaly is a novel framework that enhances log anomaly detection by synergizing Chain-of-Thought fine-tuning with reinforcement learning.<n>We have released the corresponding resources, including code and datasets.
arXiv Detail & Related papers (2025-09-18T07:35:58Z) - Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z) - Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z) - Trustworthiness for an Ultra-Wideband Localization Service [2.4979362117484714]
This paper proposes a holistic trustworthiness assessment framework for ultra-wideband self-localization.
Our goal is to provide guidance for evaluating a system's trustworthiness based on objective evidence.
Our approach guarantees that the resulting trustworthiness indicators correspond to chosen real-world threats.
arXiv Detail & Related papers (2024-08-10T11:57:10Z) - A Holistic Assessment of the Reliability of Machine Learning Systems [30.638615396429536]
This paper proposes a holistic assessment methodology for the reliability of machine learning (ML) systems.
Our framework evaluates five key properties: in-distribution accuracy, distribution-shift robustness, adversarial robustness, calibration, and out-of-distribution detection.
To provide insights into the performance of different algorithmic approaches, we identify and categorize state-of-the-art techniques.
arXiv Detail & Related papers (2023-07-20T05:00:13Z) - Demonstrating Software Reliability using Possibly Correlated Tests:
Insights from a Conservative Bayesian Approach [2.152298082788376]
We formalise informal notions of "doubting" that the executions are independent.
We develop techniques that reveal the extent to which independence assumptions can undermine conservatism in assessments.
arXiv Detail & Related papers (2022-08-16T20:27:47Z) - Reliability Testing for Natural Language Processing Systems [14.393308846231083]
We argue for the need for reliability testing and contextualize it among existing work on improving accountability.
We show how adversarial attacks can be reframed for this goal, via a framework for developing reliability tests.
arXiv Detail & Related papers (2021-05-06T11:24:58Z) - Adversarial Robustness under Long-Tailed Distribution [93.50792075460336]
Adversarial robustness has attracted extensive studies recently by revealing the vulnerability and intrinsic characteristics of deep networks.
In this work we investigate the adversarial vulnerability as well as defense under long-tailed distributions.
We propose a clean yet effective framework, RoBal, which consists of two dedicated modules, a scale-invariant and data re-balancing.
arXiv Detail & Related papers (2021-04-06T17:53:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.