Bugs in Machine Learning-based Systems: A Faultload Benchmark
- URL: http://arxiv.org/abs/2206.12311v1
- Date: Fri, 24 Jun 2022 14:20:34 GMT
- Title: Bugs in Machine Learning-based Systems: A Faultload Benchmark
- Authors: Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, Zhen Ming (Jack)
Jiang
- Abstract summary: There is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses.
In this study, we firstly investigate the verifiability of the bugs in ML-based systems and show the most important factors in each one.
We provide a benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, fairness, verifiability, and usability.
- Score: 16.956588187947993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid escalation of applying Machine Learning (ML) in various domains has
led to paying more attention to the quality of ML components. There is then a
growth of techniques and tools aiming at improving the quality of ML components
and integrating them into the ML-based system safely. Although most of these
tools use bugs' lifecycle, there is no standard benchmark of bugs to assess
their performance, compare them and discuss their advantages and weaknesses. In
this study, we firstly investigate the reproducibility and verifiability of the
bugs in ML-based systems and show the most important factors in each one. Then,
we explore the challenges of generating a benchmark of bugs in ML-based
software systems and provide a bug benchmark namely defect4ML that satisfies
all criteria of standard benchmark, i.e. relevance, reproducibility, fairness,
verifiability, and usability. This faultload benchmark contains 113 bugs
reported by ML developers on GitHub and Stack Overflow, using two of the most
popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important
challenges in Software Reliability Engineering of ML-based software systems,
like: 1) fast changes in frameworks, by providing various bugs for different
versions of frameworks, 2) code portability, by delivering similar bugs in
different ML frameworks, 3) bug reproducibility, by providing fully
reproducible bugs with complete information about required dependencies and
data, and 4) lack of detailed information on bugs, by presenting links to the
bugs' origins. defect4ML can be of interest to ML-based systems practitioners
and researchers to assess their testing tools and techniques.
Related papers
- Are Large Language Models Memorizing Bug Benchmarks? [6.640077652362016]
Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair.
A growing concern within the software engineering community is that benchmarks may not reliably reflect true LLM performance due to the risk of data leakage.
We systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks.
arXiv Detail & Related papers (2024-11-20T13:46:04Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Bug Characterization in Machine Learning-based Systems [15.521925194920893]
We investigate the characteristics of bugs in Machine Learning-based software systems and the difference between ML and non-ML bugs from the maintenance viewpoint.
Our analysis shows that nearly half of the real issues reported in ML-based systems are ML bugs, indicating that ML components are more error-prone than non-ML components.
arXiv Detail & Related papers (2023-07-26T21:21:02Z) - An Empirical Study of Bugs in Quantum Machine Learning Frameworks [5.868747298750261]
We inspect 391 real-world bugs collected from 22 open-source repositories of nine popular QML frameworks.
28% of the bugs are quantum-specific, such as erroneous unitary matrix implementation.
We manually distilled a taxonomy of five symptoms and nine root cause of bugs in QML platforms.
arXiv Detail & Related papers (2023-06-10T07:26:34Z) - Comparative analysis of real bugs in open-source Machine Learning
projects -- A Registered Report [5.275804627373337]
We investigate whether there is a discrepancy in the distribution of resolution time between Machine Learning and non-ML issues.
We measure the resolution time and size of fix of ML and non-ML issues on a controlled sample and compare the distributions for each category of issue.
arXiv Detail & Related papers (2022-09-20T18:12:12Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - Characterizing and Detecting Mismatch in Machine-Learning-Enabled
Systems [1.4695979686066065]
Development and deployment of machine learning systems remains a challenge.
In this paper, we report our findings and their implications for improving end-to-end ML-enabled system development.
arXiv Detail & Related papers (2021-03-25T19:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.