Related papers: A Survey on Failure Analysis and Fault Injection in AI Systems

A Survey on Failure Analysis and Fault Injection in AI Systems

URL: http://arxiv.org/abs/2407.00125v1
Date: Fri, 28 Jun 2024 00:32:03 GMT
Title: A Survey on Failure Analysis and Fault Injection in AI Systems
Authors: Guangba Yu, Gou Tan, Haojia Huang, Zhenyu Zhang, Pengfei Chen, Roberto Natella, Zibin Zheng,
Abstract summary: The complexity of AI systems has exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures.
Score: 28.30817443151044
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.

Related papers

Anomaly Detection and Generation with Diffusion Models: A Survey [51.61574868316922]
Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing.<n>Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest.<n>This survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
arXiv Detail & Related papers (2025-06-11T03:29:18Z)
FAIR: Facilitating Artificial Intelligence Resilience in Manufacturing Industrial Internet [4.649530200405117]
We propose a novel framework for investigating the resilience of AI performance over time. The proposed method can facilitate effective diagnosis and mitigation strategies to recover AI performance. The merits of the proposed method are elaborated using an MII testbed of connected Aerosol Jet Printing (AJP) machines, fog nodes, and Cloud with inference tasks via AI pipelines.
arXiv Detail & Related papers (2025-03-03T01:17:22Z)
Computational Safety for Generative AI: A Signal Processing Perspective [65.268245109828]
computational safety is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI. We show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts. We discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.
arXiv Detail & Related papers (2025-02-18T02:26:50Z)
Bringing Order Amidst Chaos: On the Role of Artificial Intelligence in Secure Software Engineering [0.0]
The ever-evolving technological landscape offers both opportunities and threats, creating a dynamic space where chaos and order compete. Secure software engineering (SSE) must continuously address vulnerabilities that endanger software systems. This thesis seeks to bring order to the chaos in SSE by addressing domain-specific differences that impact AI accuracy.
arXiv Detail & Related papers (2025-01-09T11:38:58Z)
AI-Compass: A Comprehensive and Effective Multi-module Testing Tool for AI Systems [26.605694684145313]
In this study, we design and implement a testing tool, tool, to comprehensively and effectively evaluate AI systems. The tool extensively assesses adversarial robustness, model interpretability, and performs neuron analysis. Our research sheds light on a general solution for AI systems testing landscape.
arXiv Detail & Related papers (2024-11-09T11:15:17Z)
From Silos to Systems: Process-Oriented Hazard Analysis for AI Systems [2.226040060318401]
We translate System Theoretic Process Analysis (STPA) for analyzing AI operation and development processes. We focus on systems that rely on machine learning algorithms and conductedA on three case studies. We find that key concepts and steps of conducting anA readily apply, albeit with a few adaptations tailored for AI systems.
arXiv Detail & Related papers (2024-10-29T20:43:18Z)
Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA [43.116608441891096]
Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning. State-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval.
arXiv Detail & Related papers (2024-10-09T03:53:26Z)
EAIRiskBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [47.69642609574771]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction. Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results. However, the deployment of these agents in physical environments presents significant safety challenges. This study introduces EAIRiskBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z)
Testing autonomous vehicles and AI: perspectives and challenges from cybersecurity, transparency, robustness and fairness [53.91018508439669]
The study explores the complexities of integrating Artificial Intelligence into Autonomous Vehicles (AVs) It examines the challenges introduced by AI components and the impact on testing procedures. The paper identifies significant challenges and suggests future directions for research and development of AI in AV technology.
arXiv Detail & Related papers (2024-02-21T08:29:42Z)
Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification. We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations. Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z)
Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis [53.24804865821692]
This study introduces a taxonomy for log anomalies and explores automated data labeling to mitigate labeling challenges. The study envisions a future where root cause analysis follows anomaly detection, unraveling the underlying triggers of anomalies.
arXiv Detail & Related papers (2023-12-22T15:04:20Z)
Interpretable and Robust AI in EEG Systems: A Survey [13.911001648611832]
We propose a taxonomy of interpretability by characterizing it into three types: backpropagation, perturbation, and inherently interpretable methods. We classify the robustness mechanisms into four classes: noise and artifacts, human variability, data acquisition instability, and adversarial attacks.
arXiv Detail & Related papers (2023-04-21T05:51:39Z)
An Exploratory Study of AI System Risk Assessment from the Lens of Data Distribution and Uncertainty [4.99372598361924]
Deep learning (DL) has become a driving force and has been widely adopted in many domains and applications. This paper initiates an early exploratory study of AI system risk assessment from both the data distribution and uncertainty angles.
arXiv Detail & Related papers (2022-12-13T03:34:25Z)
Statistical Perspectives on Reliability of Artificial Intelligence Systems [6.284088451820049]
We provide statistical perspectives on the reliability of AI systems. We introduce a so-called SMART statistical framework for AI reliability research. We discuss recent developments in modeling and analysis of AI reliability.
arXiv Detail & Related papers (2021-11-09T20:00:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.