What Causes Exceptions in Machine Learning Applications? Mining Machine
Learning-Related Stack Traces on Stack Overflow
- URL: http://arxiv.org/abs/2304.12857v1
- Date: Tue, 25 Apr 2023 14:29:07 GMT
- Title: What Causes Exceptions in Machine Learning Applications? Mining Machine
Learning-Related Stack Traces on Stack Overflow
- Authors: Amin Ghadesi, and Maxime Lamothe, and Heng Li
- Abstract summary: We study 11,449 stack traces related to seven popular Python ML libraries on Stack Overflow.
ML questions that contain stack traces gain more popularity than questions without stack traces.
Patterns related to subprocess invocation, external module execution, and remote API call are among the least likely to get accepted answers.
- Score: 6.09414932258309
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Machine learning (ML), including deep learning, has recently gained
tremendous popularity in a wide range of applications. However, like
traditional software, ML applications are not immune to the bugs that result
from programming errors. Explicit programming errors usually manifest through
error messages and stack traces. These stack traces describe the chain of
function calls that lead to an anomalous situation, or exception. Indeed, these
exceptions may cross the entire software stack (including applications and
libraries). Thus, studying the patterns in stack traces can help practitioners
and researchers understand the causes of exceptions in ML applications and the
challenges faced by ML developers. To that end, we mine Stack Overflow (SO) and
study 11,449 stack traces related to seven popular Python ML libraries. First,
we observe that ML questions that contain stack traces gain more popularity
than questions without stack traces; however, they are less likely to get
accepted answers. Second, we observe that recurrent patterns exists in ML stack
traces, even across different ML libraries, with a small portion of patterns
covering many stack traces. Third, we derive five high-level categories and 25
low-level types from the stack trace patterns: most patterns are related to
python basic syntax, model training, parallelization, data transformation, and
subprocess invocation. Furthermore, the patterns related to subprocess
invocation, external module execution, and remote API call are among the least
likely to get accepted answers on SO. Our findings provide insights for
researchers, ML library providers, and ML application developers to improve the
quality of ML libraries and their applications.
Related papers
- KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs)
We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage.
We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES.
Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query.
By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z) - What are the Machine Learning best practices reported by practitioners
on Stack Exchange? [4.882319198853359]
We present a study listing 127 Machine Learning best practices systematically mining 242 posts of 14 different Stack Exchange (STE) websites.
The list of practices is presented in a set of categories related to different stages of the implementation process of an ML-enabled system.
arXiv Detail & Related papers (2023-01-25T10:50:28Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z) - Comparative analysis of real bugs in open-source Machine Learning
projects -- A Registered Report [5.275804627373337]
We investigate whether there is a discrepancy in the distribution of resolution time between Machine Learning and non-ML issues.
We measure the resolution time and size of fix of ML and non-ML issues on a controlled sample and compare the distributions for each category of issue.
arXiv Detail & Related papers (2022-09-20T18:12:12Z) - Bugs in Machine Learning-based Systems: A Faultload Benchmark [16.956588187947993]
There is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses.
In this study, we firstly investigate the verifiability of the bugs in ML-based systems and show the most important factors in each one.
We provide a benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, fairness, verifiability, and usability.
arXiv Detail & Related papers (2022-06-24T14:20:34Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - OmniXAI: A Library for Explainable AI [98.07381528393245]
We introduce OmniXAI, an open-source Python library of eXplainable AI (XAI)
It offers omni-way explainable AI capabilities and various interpretable machine learning techniques.
For practitioners, the library provides an easy-to-use unified interface to generate the explanations for their applications.
arXiv Detail & Related papers (2022-06-01T11:35:37Z) - The Prevalence of Code Smells in Machine Learning projects [9.722159563454436]
static code analysis can be used to find potential defects in the source code, opportunities, and violations of common coding standards.
We gathered a dataset of 74 open-source Machine Learning projects, installed their dependencies and ran Pylint on them.
This resulted in a top 20 of all detected code smells, per category.
arXiv Detail & Related papers (2021-03-06T16:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.