Related papers: FAIL: Analyzing Software Failures from the News Using LLMs

FAIL: Analyzing Software Failures from the News Using LLMs

URL: http://arxiv.org/abs/2406.08221v2
Date: Wed, 18 Sep 2024 00:30:42 GMT
Title: FAIL: Analyzing Software Failures from the News Using LLMs
Authors: Dharun Anandayuvaraj, Matthew Campbell, Arav Tewari, James C. Davis,
Abstract summary: We propose the Failure Analysis Investigation with LLMs (FAIL) system to fill this gap. FAIL collects, analyzes, and summarizes software failures as reported in the news. FAIL identified and analyzed 2457 distinct failures reported across 4,184 articles.
Score: 2.7325338323814328
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software failures inform engineering work, standards, regulations. For example, the Log4J vulnerability brought government and industry attention to evaluating and securing software supply chains. Accessing private engineering records is difficult, so failure analyses tend to use information reported by the news media. However, prior works in this direction have relied on manual analysis. That has limited the scale of their analyses. The community lacks automated support to enable such analyses to consider a wide range of news sources and incidents. In this paper, we propose the Failure Analysis Investigation with LLMs (FAIL) system to fill this gap. FAIL collects, analyzes, and summarizes software failures as reported in the news. FAIL groups articles that describe the same incidents. It then analyzes incidents using existing taxonomies for postmortems, faults, and system characteristics. To tune and evaluate FAIL, we followed the methods of prior works by manually analyzing 31 software failures. FAIL achieved an F1 score of 90% for collecting news about software failures, a V-measure of 0.98 for merging articles reporting on the same incident, and extracted 90% of the facts about failures. We then applied FAIL to a total of 137,427 news articles from 11 providers published between 2010 and 2022. FAIL identified and analyzed 2457 distinct failures reported across 4,184 articles. Our findings include: (1) current generation of large language models are capable of identifying news articles that describe failures, and analyzing them according to structured taxonomies; (2) high recurrences of similar failures within organizations and across organizations; and (3) severity of the consequences of software failures have increased over the past decade. The full FAIL database is available so that researchers, engineers, and policymakers can learn from a diversity of software failures.

Related papers

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive. We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMs [17.497629884237647]
BugLens is a post-refinement framework that significantly improves static analysis precision. It raises precision from 0.10 (raw) and 0.50 (semi-automated refinement) to 0.72, substantially reducing false positives. Our results suggest that a structured LLM-based workflow can meaningfully enhance the effectiveness of static analysis tools.
arXiv Detail & Related papers (2025-04-16T02:17:06Z)
On the Diagnosis of Flaky Job Failures: Understanding and Prioritizing Failure Categories [2.8402080392117757]
flaky job failures are one of the main issues hindering Continuous Deployment (CD) This study examines 4,511 flaky job failures at TELUS to identify the different categories of flaky failures that we prioritize based on Recency, Frequency, and Monetary (RFM) measures.
arXiv Detail & Related papers (2025-01-09T05:15:55Z)
Exploring the extent of similarities in software failures across industries using LLMs [0.0]
This research utilizes the Failure Analysis Investigation with LLMs (FAIL) model to extract industry-specific information. In previous work news articles were collected from reputable sources and categorized by incidents inside a database. This research extends these methods by categorizing articles into specific domains and types of software failures.
arXiv Detail & Related papers (2024-08-07T03:48:07Z)
Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses [76.59021017301127]
We propose a large-scale traffic crash language dataset, named CrashEvent, summarizing 19,340 real-world crash reports. We further formulate the crash event feature learning as a novel text reasoning problem and further fine-tune various large language models (LLMs) to predict detailed accident outcomes. Our experiments results show that our LLM-based approach not only predicts the severity of accidents but also classifies different types of accidents and predicts injury outcomes.
arXiv Detail & Related papers (2024-06-16T03:10:16Z)
ESRO: Experience Assisted Service Reliability against Outages [2.647000585570866]
We build a diagnostic service called ESRO that recommends root causes and remediation for failures. We evaluate our model on several cloud service outages of a large enterprise over the course of 2 years.
arXiv Detail & Related papers (2023-09-13T18:04:52Z)
An Empirical Study on Using Large Language Models to Analyze Software Supply Chain Security Failures [2.176373527773389]
One way to prevent future breaches is by studying past failures. Traditional methods of analyzing these failures require manually reading and summarizing reports about them. Natural Language Processing techniques such as Large Language Models (LLMs) could be leveraged to assist the analysis of failures.
arXiv Detail & Related papers (2023-08-09T15:35:14Z)
EvLog: Identifying Anomalous Logs over Software Evolution [31.46106509190191]
We propose a novel unsupervised approach named Evolving Log extractor (EvLog) to process logs without parsing. EvLog implements an anomaly discriminator with an attention mechanism to identify the anomalous logs and avoid the issue brought by the unstable sequence. EvLog has shown effectiveness in two real-world system evolution log datasets with an average F1 score of 0.955 and 0.847 in the intra-version setting and inter-version setting, respectively.
arXiv Detail & Related papers (2023-06-02T12:58:00Z)
Applying Machine Learning Analysis for Software Quality Test [0.0]
It is critical to comprehend what triggers maintenance and if it may be predicted. Numerous methods of assessing the complexity of created programs may produce useful prediction models. In this paper, the machine learning is applied on the available data to calculate the cumulative software failure levels.
arXiv Detail & Related papers (2023-05-16T06:10:54Z)
Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps [71.12026848664753]
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce.
arXiv Detail & Related papers (2022-04-21T02:33:34Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
S3M: Siamese Stack (Trace) Similarity Measure [55.58269472099399]
We present S3M -- the first approach to computing stack trace similarity based on deep learning. It is based on a biLSTM encoder and a fully-connected classifier to compute similarity. Our experiments demonstrate the superiority of our approach over the state-of-the-art on both open-sourced data and a private JetBrains dataset.
arXiv Detail & Related papers (2021-03-18T21:10:41Z)
Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users. We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z)
D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.