Bug Characterization in Machine Learning-based Systems
- URL: http://arxiv.org/abs/2307.14512v1
- Date: Wed, 26 Jul 2023 21:21:02 GMT
- Title: Bug Characterization in Machine Learning-based Systems
- Authors: Mohammad Mehdi Morovati, Amin Nikanjam, Florian Tambon, Foutse Khomh,
Zhen Ming (Jack) Jiang
- Abstract summary: We investigate the characteristics of bugs in Machine Learning-based software systems and the difference between ML and non-ML bugs from the maintenance viewpoint.
Our analysis shows that nearly half of the real issues reported in ML-based systems are ML bugs, indicating that ML components are more error-prone than non-ML components.
- Score: 15.521925194920893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rapid growth of applying Machine Learning (ML) in different domains,
especially in safety-critical areas, increases the need for reliable ML
components, i.e., a software component operating based on ML. Understanding the
bugs characteristics and maintenance challenges in ML-based systems can help
developers of these systems to identify where to focus maintenance and testing
efforts, by giving insights into the most error-prone components, most common
bugs, etc. In this paper, we investigate the characteristics of bugs in
ML-based software systems and the difference between ML and non-ML bugs from
the maintenance viewpoint. We extracted 447,948 GitHub repositories that used
one of the three most popular ML frameworks, i.e., TensorFlow, Keras, and
PyTorch. After multiple filtering steps, we select the top 300 repositories
with the highest number of closed issues. We manually investigate the extracted
repositories to exclude non-ML-based systems. Our investigation involved a
manual inspection of 386 sampled reported issues in the identified ML-based
systems to indicate whether they affect ML components or not. Our analysis
shows that nearly half of the real issues reported in ML-based systems are ML
bugs, indicating that ML components are more error-prone than non-ML
components. Next, we thoroughly examined 109 identified ML bugs to identify
their root causes, symptoms, and calculate their required fixing time. The
results also revealed that ML bugs have significantly different characteristics
compared to non-ML bugs, in terms of the complexity of bug-fixing (number of
commits, changed files, and changed lines of code). Based on our results,
fixing ML bugs are more costly and ML components are more error-prone, compared
to non-ML bugs and non-ML components respectively. Hence, paying a significant
attention to the reliability of the ML components is crucial in ML-based
systems.
Related papers
- SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks.
We show that even the most prominent LLMs exhibit these error patterns in their outputs.
Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - When Code Smells Meet ML: On the Lifecycle of ML-specific Code Smells in
ML-enabled Systems [13.718420553401662]
We aim to investigate the emergence and evolution of specific types of quality-related concerns known as ML-specific code smells.
More specifically, we present a plan to study ML-specific code smells by empirically analyzing their prevalence in real ML-enabled systems.
We will conduct an exploratory study, mining a large dataset of ML-enabled systems and analyzing over 400k commits about 337 projects.
arXiv Detail & Related papers (2024-03-13T07:43:45Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM.
For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs.
We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z) - Vulnerability of Machine Learning Approaches Applied in IoT-based Smart Grid: A Review [51.31851488650698]
Machine learning (ML) sees an increasing prevalence of being used in the internet-of-things (IoT)-based smart grid.
adversarial distortion injected into the power signal will greatly affect the system's normal control and operation.
It is imperative to conduct vulnerability assessment for MLsgAPPs applied in the context of safety-critical power systems.
arXiv Detail & Related papers (2023-08-30T03:29:26Z) - Understanding the Complexity and Its Impact on Testing in ML-Enabled
Systems [8.630445165405606]
We study Rasa 3.0, an industrial dialogue system that has been widely adopted by various companies around the world.
Our goal is to characterize the complexity of such a largescale ML-enabled system and to understand the impact of the complexity on testing.
Our study reveals practical implications for software engineering for ML-enabled systems.
arXiv Detail & Related papers (2023-01-10T08:13:24Z) - Comparative analysis of real bugs in open-source Machine Learning
projects -- A Registered Report [5.275804627373337]
We investigate whether there is a discrepancy in the distribution of resolution time between Machine Learning and non-ML issues.
We measure the resolution time and size of fix of ML and non-ML issues on a controlled sample and compare the distributions for each category of issue.
arXiv Detail & Related papers (2022-09-20T18:12:12Z) - Bugs in Machine Learning-based Systems: A Faultload Benchmark [16.956588187947993]
There is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses.
In this study, we firstly investigate the verifiability of the bugs in ML-based systems and show the most important factors in each one.
We provide a benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, fairness, verifiability, and usability.
arXiv Detail & Related papers (2022-06-24T14:20:34Z) - Characterizing and Detecting Mismatch in Machine-Learning-Enabled
Systems [1.4695979686066065]
Development and deployment of machine learning systems remains a challenge.
In this paper, we report our findings and their implications for improving end-to-end ML-enabled system development.
arXiv Detail & Related papers (2021-03-25T19:40:29Z) - Understanding the Usability Challenges of Machine Learning In
High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains.
In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions.
We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.