Related papers: Formal Analysis of Metastable Failures in Software Systems

Formal Analysis of Metastable Failures in Software Systems

URL: http://arxiv.org/abs/2510.03551v2
Date: Tue, 14 Oct 2025 07:37:45 GMT
Title: Formal Analysis of Metastable Failures in Software Systems
Authors: Peter Alvaro, Rebecca Isaacs, Rupak Majumdar, Kiran-Kumar Muniswamy-Reddy, Mahmoud Salamati, Sadegh Soudjani,
Abstract summary: We provide the mathematical foundations of metastability in request-response server systems.<n>We show how to construct continuous-time Markov chains (CTMCs) that approximate the semantics of the programs.<n>We show that our qualitative visual analysis captures and predicts many instances of metastability that were observed in the field in a matter of milliseconds.
Score: 5.436969030534807
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many large-scale software systems demonstrate metastable failures. In this class of failures, a stressor such as a temporary spike in workload causes the system performance to drop and, subsequently, the system performance continues to remain low even when the stressor is removed. These failures have been reported by many large corporations and considered to be a rare but catastrophic source of availability outages in cloud systems. In this paper, we provide the mathematical foundations of metastability in request-response server systems. We model such systems using a domain-specific language. We show how to construct continuous-time Markov chains (CTMCs) that approximate the semantics of the programs through modeling and data-driven calibration. We use the structure of the CTMC models to provide a visualization of the qualitative behavior of the model. The visualization is a surprisingly effective way to identify system parameterizations that cause a system to show metastable behaviors. We complement the qualitative analysis with quantitative predictions. We provide a formal notion of metastable behaviors based on escape probabilities, and show that metastable behaviors are related to the eigenvalue structure of the CTMC. Our characterization leads to algorithmic tools to predict recovery times in metastable models of server systems. We have implemented our technique in a tool for the modeling and analysis of server systems. Through models inspired by failures in real request-response systems, we show that our qualitative visual analysis captures and predicts many instances of metastability that were observed in the field in a matter of milliseconds. Our algorithms confirm that recovery times surge as the system parameters approach metastable modes in the dynamics.

Related papers

DynaSTy: A Framework for SpatioTemporal Node Attribute Prediction in Dynamic Graphs [0.3991718754182582]
Accurate forecasting of node-level attributes on dynamic graphs is critical for applications ranging from financial trust networks to biological networks.<n>In this work we propose an end-to-end dynamic edge-biased edge-temporal model that ingests a multi-dimensional time series of adjacency matrices.<n>Our method consistently outperforms strong baselines on Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE)
arXiv Detail & Related papers (2026-01-08T21:32:20Z)
From Physics to Machine Learning and Back: Part II - Learning and Observational Bias in PHM [52.64097278841485]
Review examines how incorporating learning and observational biases through physics-informed modeling and data strategies can guide models toward physically consistent and reliable predictions.<n>Fast adaptation methods including meta-learning and few-shot learning are reviewed alongside domain generalization techniques.
arXiv Detail & Related papers (2025-09-25T14:15:43Z)
Process mining-driven modeling and simulation to enhance fault diagnosis in cyber-physical systems [5.065341495341096]
Fault diagnosis in Cyber-Physical Systems (CPSs) is essential for ensuring system dependability and operational efficiency.<n>We present a novel unsupervised fault diagnosis methodology that integrates collective anomaly detection in time series, process mining, and simulation.<n>This enables the creation of comprehensive fault dictionaries that support predictive maintenance and the development of digital twins for industrial environments.
arXiv Detail & Related papers (2025-06-26T17:29:37Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
ATOM: A Framework of Detecting Query-Based Model Extraction Attacks for Graph Neural Networks [18.488168353080464]
Graph Neural Networks (GNNs) have gained traction in Graph-based Machine Learning as a Service (GML) platforms, yet they remain vulnerable to graph-based model extraction attacks (MEAs)<n>We propose ATOM, a novel real-time MEA detection framework tailored for GNNs.<n>ATOM integrates sequential modeling and reinforcement learning to dynamically detect evolving attack patterns, while leveraging $k$core embedding to capture the structural properties, enhancing detection precision.
arXiv Detail & Related papers (2025-03-20T20:25:32Z)
Anomaly Detection in Complex Dynamical Systems: A Systematic Framework Using Embedding Theory and Physics-Inspired Consistency [0.0]
Anomaly detection in complex dynamical systems is essential for ensuring reliability, safety, and efficiency in industrial and cyber-physical infrastructures.<n>We propose a system-theoretic approach to anomaly detection, grounded in classical embedding theory and physics-inspired consistency principles.<n>Our findings support the hypothesis that anomalies disrupt stable system dynamics, providing a robust signal for anomaly detection.
arXiv Detail & Related papers (2025-02-26T17:06:13Z)
Representing Timed Automata and Timing Anomalies of Cyber-Physical Production Systems in Knowledge Graphs [51.98400002538092]
This paper aims to improve model-based anomaly detection in CPPS by combining the learned timed automaton with a formal knowledge graph about the system. Both the model and the detected anomalies are described in the knowledge graph in order to allow operators an easier interpretation of the model and the detected anomalies.
arXiv Detail & Related papers (2023-08-25T15:25:57Z)
Robustness and Generalization Performance of Deep Learning Models on Cyber-Physical Systems: A Comparative Study [71.84852429039881]
Investigation focuses on the models' ability to handle a range of perturbations, such as sensor faults and noise. We test the generalization and transfer learning capabilities of these models by exposing them to out-of-distribution (OOD) samples.
arXiv Detail & Related papers (2023-06-13T12:43:59Z)
Analysis of Numerical Integration in RNN-Based Residuals for Fault Diagnosis of Dynamic Systems [0.6999740786886536]
The paper includes a case study of a heavy-duty truck's after-treatment system to highlight the potential of these techniques for improving fault diagnosis performance. Data-driven modeling and machine learning are widely used to model the behavior of dynamic systems.
arXiv Detail & Related papers (2023-05-08T12:48:18Z)
Leveraging the structure of dynamical systems for data-driven modeling [111.45324708884813]
We consider the impact of the training set and its structure on the quality of the long-term prediction. We show how an informed design of the training set, based on invariants of the system and the structure of the underlying attractor, significantly improves the resulting models.
arXiv Detail & Related papers (2021-12-15T20:09:20Z)
Using Data Assimilation to Train a Hybrid Forecast System that Combines Machine-Learning and Knowledge-Based Components [52.77024349608834]
We consider the problem of data-assisted forecasting of chaotic dynamical systems when the available data is noisy partial measurements. We show that by using partial measurements of the state of the dynamical system, we can train a machine learning model to improve predictions made by an imperfect knowledge-based model.
arXiv Detail & Related papers (2021-02-15T19:56:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.