Related papers: Bugs in Machine Learning-based Systems: A Faultload Benchmark

Bugs in Machine Learning-based Systems: A Faultload Benchmark

URL: http://arxiv.org/abs/2206.12311v1
Date: Fri, 24 Jun 2022 14:20:34 GMT
Title: Bugs in Machine Learning-based Systems: A Faultload Benchmark
Authors: Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, Zhen Ming (Jack) Jiang
Abstract summary: There is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the verifiability of the bugs in ML-based systems and show the most important factors in each one. We provide a benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, fairness, verifiability, and usability.
Score: 16.956588187947993
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 113 bugs reported by ML developers on GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs' origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.

Related papers

Information Density Principle for MLLM Benchmarks [59.88484827926759]
We propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs. Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks. Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density.
arXiv Detail & Related papers (2025-03-13T05:58:41Z)
Where's the Bug? Attention Probing for Scalable Fault Localization [18.699014321422023]
We present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels. BAP is significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.
arXiv Detail & Related papers (2025-02-19T18:59:32Z)
Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability. Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z)
Are Large Language Models Memorizing Bug Benchmarks? [6.640077652362016]
Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. A growing concern within the software engineering community is that benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. We systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks.
arXiv Detail & Related papers (2024-11-20T13:46:04Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Bug Characterization in Machine Learning-based Systems [15.521925194920893]
We investigate the characteristics of bugs in Machine Learning-based software systems and the difference between ML and non-ML bugs from the maintenance viewpoint. Our analysis shows that nearly half of the real issues reported in ML-based systems are ML bugs, indicating that ML components are more error-prone than non-ML components.
arXiv Detail & Related papers (2023-07-26T21:21:02Z)
An Empirical Study of Bugs in Quantum Machine Learning Frameworks [5.868747298750261]
We inspect 391 real-world bugs collected from 22 open-source repositories of nine popular QML frameworks. 28% of the bugs are quantum-specific, such as erroneous unitary matrix implementation. We manually distilled a taxonomy of five symptoms and nine root cause of bugs in QML platforms.
arXiv Detail & Related papers (2023-06-10T07:26:34Z)
Comparative analysis of real bugs in open-source Machine Learning projects -- A Registered Report [5.275804627373337]
We investigate whether there is a discrepancy in the distribution of resolution time between Machine Learning and non-ML issues. We measure the resolution time and size of fix of ML and non-ML issues on a controlled sample and compare the distributions for each category of issue.
arXiv Detail & Related papers (2022-09-20T18:12:12Z)
BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization. We provide a general benchmark with a diversity of real and synthetic Java bugs. We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z)
Characterizing and Detecting Mismatch in Machine-Learning-Enabled Systems [1.4695979686066065]
Development and deployment of machine learning systems remains a challenge. In this paper, we report our findings and their implications for improving end-to-end ML-enabled system development.
arXiv Detail & Related papers (2021-03-25T19:40:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.