Related papers: CSnake: Detecting Self-Sustaining Cascading Failure via Causal Stitching of Fault Propagations

CSnake: Detecting Self-Sustaining Cascading Failure via Causal Stitching of Fault Propagations

URL: http://arxiv.org/abs/2509.26529v2
Date: Sat, 25 Oct 2025 20:07:01 GMT
Title: CSnake: Detecting Self-Sustaining Cascading Failure via Causal Stitching of Fault Propagations
Authors: Shangshu Qian, Lin Tan, Yongle Zhang,
Abstract summary: This paper presents CSnake, a fault injection framework to expose self-sustaining cascading failures in distributed systems.<n>CSnake uses the novel idea of causal stitching, which causally links multiple single-fault injections in different tests to simulate complex fault propagation chains.<n>CSnake detected 15 bugs that cause self-sustaining cascading failures in five systems, five of which have been confirmed with two fixed.
Score: 7.708183748221455
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent studies have revealed that self-sustaining cascading failures in distributed systems frequently lead to widespread outages, which are challenging to contain and recover from. Existing failure detection techniques struggle to expose such failures prior to deployment, as they typically require a complex combination of specific conditions to be triggered. This challenge stems from the inherent nature of cascading failures, as they typically involve a sequence of fault propagations, each activated by distinct conditions. This paper presents CSnake, a fault injection framework to expose self-sustaining cascading failures in distributed systems. CSnake uses the novel idea of causal stitching, which causally links multiple single-fault injections in different tests to simulate complex fault propagation chains. To identify these chains, CSnake designs a counterfactual causality analysis of fault propagations - fault causality analysis (FCA): FCA compares the execution trace of a fault injection run with its corresponding profile run (i.e., same test w/o the injection) and identifies any additional faults triggered, which are considered to have a causal relationship with the injected fault. To address the large search space of fault and workload combinations, CSnake employs a three-phase allocation protocol of test budget that prioritizes faults with unique and diverse causal consequences, increasing the likelihood of uncovering conditional fault propagations. Furthermore, to avoid incorrectly connecting fault propagations from workloads with incompatible conditions, CSnake performs a local compatibility check that approximately checks the compatibility of the path constraints associated with connected fault propagations with low overhead. CSnake detected 15 bugs that cause self-sustaining cascading failures in five systems, five of which have been confirmed with two fixed.

Related papers

ProbeLLM: Automating Principled Diagnosis of LLM Failures [89.44131968886184]
We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes.<n>By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence.
arXiv Detail & Related papers (2026-02-13T14:33:13Z)
CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse [1.4608214000864057]
CausalT5K is a diagnostic benchmark of over 5,000 cases across 10 domains.<n>Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives.<n>Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail.
arXiv Detail & Related papers (2026-02-09T17:36:56Z)
Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z)
ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs [21.409155842171497]
Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs)<n>Errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning.<n>We introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method to address this specific vulnerability.
arXiv Detail & Related papers (2025-08-07T11:26:40Z)
TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z)
DeCaFlow: A Deconfounding Causal Generative Model [58.411886466157185]
We introduce DeCaFlow, a deconfounding causal generative model.<n>We extend previous results on causal estimation under hidden confounding to show that a single instance of DeCaFlow provides correct estimates for all causal queries identifiable with do-calculus.<n>Our empirical results on diverse settings show that DeCaFlow outperforms existing approaches, while demonstrating its out-of-the-box applicability to any given causal graph.
arXiv Detail & Related papers (2025-03-19T11:14:16Z)
Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation [49.53202761595912]
Continual Test-Time Adaptation involves adapting a pre-trained source model to continually changing unsupervised target domains. We analyze the challenges of this task: online environment, unsupervised nature, and the risks of error accumulation and catastrophic forgetting. We propose an uncertainty-aware buffering approach to identify and aggregate significant samples with high certainty from the unsupervised, single-pass data stream.
arXiv Detail & Related papers (2024-07-12T15:48:40Z)
FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems [35.310727641258715]
We propose an automated approach, named FaultProfIT, for Fault pattern Profiling of Incident Tickets. It leverages hierarchy-guided contrastive learning to train a hierarchy-aware incident encoder and predicts fault patterns with enhanced incident representations. To date, FaultProfIT has analyzed 10,000+ incidents from 30+ cloud services, successfully revealing several fault trends that have informed system improvements.
arXiv Detail & Related papers (2024-02-27T15:14:19Z)
Concatenating quantum error-correcting codes with decoherence-free subspaces and vice versa [0.0]
Quantum error-correcting codes (QECCs) and decoherence-free subspace (DFS) codes provide active and passive means to address certain types of errors. The concatenation of a QECC and a DFS code results in a degenerate code that splits into actively and passively correcting parts. We show that for sufficiently strongly correlated errors, the concatenation with the DFS as the inner code provides better entanglement fidelity.
arXiv Detail & Related papers (2023-12-13T17:48:12Z)
Causal Disentanglement Hidden Markov Model for Fault Diagnosis [55.90917958154425]
We propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism. Specifically, we make full use of the time-series data and progressively disentangle the vibration signal into fault-relevant and fault-irrelevant factors. To expand the scope of the application, we adopt unsupervised domain adaptation to transfer the learned disentangled representations to other working environments.
arXiv Detail & Related papers (2023-08-06T05:58:45Z)
SCCAM: Supervised Contrastive Convolutional Attention Mechanism for Ante-hoc Interpretable Fault Diagnosis with Limited Fault Samples [9.648963514691046]
We propose a supervised contrastive convolutional attention mechanism (SCCAM) with ante-hoc interpretability to learn from limited fault samples. Three common fault diagnosis scenarios are covered, including a balanced scenario for additional verification and two scenarios with limited fault samples. The proposed SCCAM method can achieve better performance compared with the state-of-the-art methods on fault classification and root cause analysis.
arXiv Detail & Related papers (2023-02-03T08:43:55Z)
Causality-Based Multivariate Time Series Anomaly Detection [63.799474860969156]
We formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data. We then propose a causality-based anomaly detection approach, which first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism. We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications.
arXiv Detail & Related papers (2022-06-30T06:00:13Z)
Fast and Accurate Error Simulation for CNNs against Soft Errors [64.54260986994163]
We present a framework for the reliability analysis of Conal Neural Networks (CNNs) via an error simulation engine. These error models are defined based on the corruption patterns of the output of the CNN operators induced by faults. We show that our methodology achieves about 99% accuracy of the fault effects w.r.t. SASSIFI, and a speedup ranging from 44x up to 63x w.r.t.FI, that only implements a limited set of error models.
arXiv Detail & Related papers (2022-06-04T19:45:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.