BayesFLo: Bayesian fault localization of complex software systems
- URL: http://arxiv.org/abs/2403.08079v2
- Date: Fri, 07 Mar 2025 18:13:10 GMT
- Title: BayesFLo: Bayesian fault localization of complex software systems
- Authors: Yi Ji, Simon Mak, Ryan Lekivetz, Joseph Morgan,
- Abstract summary: Key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations.<n>Existing fault localization methods have two key limitations.<n>They do not incorporate domain and/or structural knowledge from test engineers.
- Score: 3.8607945277671054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods have two key limitations: they (i) do not incorporate domain and/or structural knowledge from test engineers, and (ii) do not provide a probabilistic assessment of risk for potential root causes. Such methods can thus fail to confidently whittle down the combinatorial number of potential root causes in complex systems, resulting in prohibitively high testing costs. To address this, we propose a novel Bayesian fault localization framework called BayesFLo, which leverages a flexible Bayesian model for identifying potential root causes with probabilistic uncertainty. Using a carefully-specified prior on root cause probabilities, BayesFLo permits the integration of domain and structural knowledge via the principles of combination hierarchy and heredity, which capture the expected structure of failure-inducing combinations. We then develop new algorithms for efficient computation of posterior root cause probabilities, leveraging recent tools from integer programming and graph representations. Finally, we demonstrate the effectiveness of BayesFLo over existing methods in two fault localization case studies on the Traffic Alert and Collision Avoidance System and the JMP Easy DOE platform.
Related papers
- AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework [4.782965804438204]
Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation, and evaluation.<n>We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP)<n>Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback.
arXiv Detail & Related papers (2026-03-03T18:25:00Z) - FastFI: Enhancing API Call-Site Robustness in Microservice-Based Systems with Fault Injection [8.84679612879872]
Fault injection is a key technique for assessing software reliability, enabling proactive detection of system defects.<n>FastFI is a fault-injection-guided framework to enhance the robustness of API call sites in microservice-based systems.<n>FastFI reduces end-to-end fault-injection time by an average of 76.12% compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-01-21T09:24:54Z) - Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs [50.075587392477935]
We conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems.<n>Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack.
arXiv Detail & Related papers (2026-01-20T06:42:56Z) - Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems [54.916243942641444]
Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications.<n>We study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline.
arXiv Detail & Related papers (2025-12-23T03:10:09Z) - TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.
Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z) - Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation [58.799397354312596]
Large language models (LLMs) have demonstrated remarkable capabilities in various domains, particularly in system 1 tasks.
Recent research on System2-to-System1 methods surge, exploring the System 2 reasoning knowledge via inference-time computation.
In this paper, we focus on code generation, which is a representative System 2 task, and identify two primary challenges.
arXiv Detail & Related papers (2025-02-18T03:20:50Z) - Can Search-Based Testing with Pareto Optimization Effectively Cover Failure-Revealing Test Inputs? [2.038863628148453]
We argue that search-based software testing (SBST) is inadequate for covering failure-inducing areas within a search domain.
We measure the coverage of failure-revealing test inputs in the input space using a metric that we refer to as the Coverage Inverted Distance quality indicator.
arXiv Detail & Related papers (2024-10-15T16:44:40Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Supporting Early-Safety Analysis of IoT Systems by Exploiting Testing
Techniques [9.095386349136717]
FailureLogic Analysis FLA is a technique that helps predict potential failure scenarios.
manually specifying FLA rules can be arduous and errorprone leading to incomplete or inaccurate specifications.
We propose adopting testing methodologies to improve the completeness and correctness of these rules.
arXiv Detail & Related papers (2023-09-06T13:32:39Z) - Bayesian Safety Validation for Failure Probability Estimation of Black-Box Systems [34.61865848439637]
Estimating the probability of failure is an important step in the certification of safety-critical systems.
This work frames the problem of black-box safety validation as a Bayesian optimization problem.
The algorithm is designed to search for failures, compute the most-likely failure, and estimate the failure probability over an operating domain.
arXiv Detail & Related papers (2023-05-03T22:22:48Z) - A Learning-Based Optimal Uncertainty Quantification Method and Its
Application to Ballistic Impact Problems [1.713291434132985]
This paper concerns the optimal (supremum and infimum) uncertainty bounds for systems where the input (or prior) measure is only partially/imperfectly known.
We demonstrate the learning based framework on the uncertainty optimization problem.
We show that the approach can be used to construct maps for the performance certificate and safety in engineering practice.
arXiv Detail & Related papers (2022-12-28T14:30:53Z) - Bayesian sequential design of computer experiments for quantile set inversion [0.0]
We consider an unknown multivariate function representing a system-such as a complex numerical simulator.
Our objective is to estimate the set of deterministic inputs leading to outputs whose probability is less than a given threshold.
arXiv Detail & Related papers (2022-11-02T10:14:05Z) - Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers.
We then present the pointwise feasibility conditions of the resulting safety controller.
We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z) - Large-Scale Sequential Learning for Recommender and Engineering Systems [91.3755431537592]
In this thesis, we focus on the design of an automatic algorithms that provide personalized ranking by adapting to the current conditions.
For the former, we propose novel algorithm called SAROS that take into account both kinds of feedback for learning over the sequence of interactions.
The proposed idea of taking into account the neighbour lines shows statistically significant results in comparison with the initial approach for faults detection in power grid.
arXiv Detail & Related papers (2022-05-13T21:09:41Z) - Rare event estimation using stochastic spectral embedding [0.0]
Estimating the probability of rare failure events is an essential step in the reliability assessment of engineering systems.
We propose a set of modifications that tailor the algorithm to efficiently solve rare event estimation problems.
arXiv Detail & Related papers (2021-06-09T16:10:33Z) - BayesIMP: Uncertainty Quantification for Causal Data Fusion [52.184885680729224]
We study the causal data fusion problem, where datasets pertaining to multiple causal graphs are combined to estimate the average treatment effect of a target variable.
We introduce a framework which combines ideas from probabilistic integration and kernel mean embeddings to represent interventional distributions in the reproducing kernel Hilbert space.
arXiv Detail & Related papers (2021-06-07T10:14:18Z) - Probabilistic robust linear quadratic regulators with Gaussian processes [73.0364959221845]
Probabilistic models such as Gaussian processes (GPs) are powerful tools to learn unknown dynamical systems from data for subsequent use in control design.
We present a novel controller synthesis for linearized GP dynamics that yields robust controllers with respect to a probabilistic stability margin.
arXiv Detail & Related papers (2021-05-17T08:36:18Z) - Handling Epistemic and Aleatory Uncertainties in Probabilistic Circuits [18.740781076082044]
We propose an approach to overcome the independence assumption behind most of the approaches dealing with a large class of probabilistic reasoning.
We provide an algorithm for Bayesian learning from sparse, albeit complete, observations.
Each leaf of such circuits is labelled with a beta-distributed random variable that provides us with an elegant framework for representing uncertain probabilities.
arXiv Detail & Related papers (2021-02-22T10:03:15Z) - Improving Uncertainty Calibration via Prior Augmented Data [56.88185136509654]
Neural networks have proven successful at learning from complex data distributions by acting as universal function approximators.
They are often overconfident in their predictions, which leads to inaccurate and miscalibrated probabilistic predictions.
We propose a solution by seeking out regions of feature space where the model is unjustifiably overconfident, and conditionally raising the entropy of those predictions towards that of the prior distribution of the labels.
arXiv Detail & Related papers (2021-02-22T07:02:37Z) - Gaussian Process-based Min-norm Stabilizing Controller for
Control-Affine Systems with Uncertain Input Effects and Dynamics [90.81186513537777]
We propose a novel compound kernel that captures the control-affine nature of the problem.
We show that this resulting optimization problem is convex, and we call it Gaussian Process-based Control Lyapunov Function Second-Order Cone Program (GP-CLF-SOCP)
arXiv Detail & Related papers (2020-11-14T01:27:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.