Comparative analysis of real bugs in open-source Machine Learning
projects -- A Registered Report
- URL: http://arxiv.org/abs/2209.09932v1
- Date: Tue, 20 Sep 2022 18:12:12 GMT
- Title: Comparative analysis of real bugs in open-source Machine Learning
projects -- A Registered Report
- Authors: Tuan Dung Lai, Anj Simmons, Scott Barnett, Jean-Guy Schneider, Rajesh
Vasa
- Abstract summary: We investigate whether there is a discrepancy in the distribution of resolution time between Machine Learning and non-ML issues.
We measure the resolution time and size of fix of ML and non-ML issues on a controlled sample and compare the distributions for each category of issue.
- Score: 5.275804627373337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Machine Learning (ML) systems rely on data to make predictions,
the systems have many added components compared to traditional software systems
such as the data processing pipeline, serving pipeline, and model training.
Existing research on software maintenance has studied the issue-reporting needs
and resolution process for different types of issues, such as performance and
security issues. However, ML systems have specific classes of faults, and
reporting ML issues requires domain-specific information. Because of the
different characteristics between ML and traditional Software Engineering
systems, we do not know to what extent the reporting needs are different, and
to what extent these differences impact the issue resolution process.
Objective: Our objective is to investigate whether there is a discrepancy in
the distribution of resolution time between ML and non-ML issues and whether
certain categories of ML issues require a longer time to resolve based on real
issue reports in open-source applied ML projects. We further investigate the
size of fix of ML issues and non-ML issues. Method: We extract issues reports,
pull requests and code files in recent active applied ML projects from Github,
and use an automatic approach to filter ML and non-ML issues. We manually label
the issues using a known taxonomy of deep learning bugs. We measure the
resolution time and size of fix of ML and non-ML issues on a controlled sample
and compare the distributions for each category of issue.
Related papers
- SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks.
We show that even the most prominent LLMs exhibit these error patterns in their outputs.
Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z) - Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making.
Existing evaluations tend to rely solely on a final success rate.
We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z) - Verbalized Machine Learning: Revisiting Machine Learning with Language Models [63.10391314749408]
We introduce the framework of verbalized machine learning (VML)
VML constrains the parameter space to be human-interpretable natural language.
We empirically verify the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability.
arXiv Detail & Related papers (2024-06-06T17:59:56Z) - Understanding Information Storage and Transfer in Multi-modal Large Language Models [51.20840103605018]
We study how Multi-modal Large Language Models process information in a factual visual question answering task.
Key findings show that these MLLMs rely on self-attention blocks in much earlier layers for information storage.
We introduce MultEdit, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs.
arXiv Detail & Related papers (2024-06-06T16:35:36Z) - When Code Smells Meet ML: On the Lifecycle of ML-specific Code Smells in
ML-enabled Systems [13.718420553401662]
We aim to investigate the emergence and evolution of specific types of quality-related concerns known as ML-specific code smells.
More specifically, we present a plan to study ML-specific code smells by empirically analyzing their prevalence in real ML-enabled systems.
We will conduct an exploratory study, mining a large dataset of ML-enabled systems and analyzing over 400k commits about 337 projects.
arXiv Detail & Related papers (2024-03-13T07:43:45Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Bug Characterization in Machine Learning-based Systems [15.521925194920893]
We investigate the characteristics of bugs in Machine Learning-based software systems and the difference between ML and non-ML bugs from the maintenance viewpoint.
Our analysis shows that nearly half of the real issues reported in ML-based systems are ML bugs, indicating that ML components are more error-prone than non-ML components.
arXiv Detail & Related papers (2023-07-26T21:21:02Z) - Bugs in Machine Learning-based Systems: A Faultload Benchmark [16.956588187947993]
There is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses.
In this study, we firstly investigate the verifiability of the bugs in ML-based systems and show the most important factors in each one.
We provide a benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, fairness, verifiability, and usability.
arXiv Detail & Related papers (2022-06-24T14:20:34Z) - Towards Perspective-Based Specification of Machine Learning-Enabled
Systems [1.3406258114080236]
This paper describes our work towards a perspective-based approach for specifying ML-enabled systems.
The approach involves analyzing a set of 45 ML concerns grouped into five perspectives: objectives, user experience, infrastructure, model, and data.
The main contribution of this paper is to provide two new artifacts that can be used to help specifying ML-enabled systems.
arXiv Detail & Related papers (2022-06-20T13:09:23Z) - Understanding the Usability Challenges of Machine Learning In
High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains.
In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions.
We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z) - Vamsa: Automated Provenance Tracking in Data Science Scripts [17.53546311589593]
We introduce the ML provenance tracking problem.
We discuss the challenges in capturing such information in the context of Python.
We present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the users' code.
arXiv Detail & Related papers (2020-01-07T02:39:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.