What Causes Exceptions in Machine Learning Applications? Mining Machine
Learning-Related Stack Traces on Stack Overflow
- URL: http://arxiv.org/abs/2304.12857v1
- Date: Tue, 25 Apr 2023 14:29:07 GMT
- Title: What Causes Exceptions in Machine Learning Applications? Mining Machine
Learning-Related Stack Traces on Stack Overflow
- Authors: Amin Ghadesi, and Maxime Lamothe, and Heng Li
- Abstract summary: We study 11,449 stack traces related to seven popular Python ML libraries on Stack Overflow.
ML questions that contain stack traces gain more popularity than questions without stack traces.
Patterns related to subprocess invocation, external module execution, and remote API call are among the least likely to get accepted answers.
- Score: 6.09414932258309
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Machine learning (ML), including deep learning, has recently gained
tremendous popularity in a wide range of applications. However, like
traditional software, ML applications are not immune to the bugs that result
from programming errors. Explicit programming errors usually manifest through
error messages and stack traces. These stack traces describe the chain of
function calls that lead to an anomalous situation, or exception. Indeed, these
exceptions may cross the entire software stack (including applications and
libraries). Thus, studying the patterns in stack traces can help practitioners
and researchers understand the causes of exceptions in ML applications and the
challenges faced by ML developers. To that end, we mine Stack Overflow (SO) and
study 11,449 stack traces related to seven popular Python ML libraries. First,
we observe that ML questions that contain stack traces gain more popularity
than questions without stack traces; however, they are less likely to get
accepted answers. Second, we observe that recurrent patterns exists in ML stack
traces, even across different ML libraries, with a small portion of patterns
covering many stack traces. Third, we derive five high-level categories and 25
low-level types from the stack trace patterns: most patterns are related to
python basic syntax, model training, parallelization, data transformation, and
subprocess invocation. Furthermore, the patterns related to subprocess
invocation, external module execution, and remote API call are among the least
likely to get accepted answers on SO. Our findings provide insights for
researchers, ML library providers, and ML application developers to improve the
quality of ML libraries and their applications.
Related papers
- Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.
Traditional fuzzing struggles with the complexity and API diversity of DL libraries.
We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z) - ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution.
We show that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving an absolute increase of 11.24% over niave RAG approaches and 14.07% over pretraining methods in pass@10.
arXiv Detail & Related papers (2024-12-06T19:00:15Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs)
We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage.
We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES.
Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query.
By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z) - What are the Machine Learning best practices reported by practitioners
on Stack Exchange? [4.882319198853359]
We present a study listing 127 Machine Learning best practices systematically mining 242 posts of 14 different Stack Exchange (STE) websites.
The list of practices is presented in a set of categories related to different stages of the implementation process of an ML-enabled system.
arXiv Detail & Related papers (2023-01-25T10:50:28Z) - Comparative analysis of real bugs in open-source Machine Learning
projects -- A Registered Report [5.275804627373337]
We investigate whether there is a discrepancy in the distribution of resolution time between Machine Learning and non-ML issues.
We measure the resolution time and size of fix of ML and non-ML issues on a controlled sample and compare the distributions for each category of issue.
arXiv Detail & Related papers (2022-09-20T18:12:12Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - The Prevalence of Code Smells in Machine Learning projects [9.722159563454436]
static code analysis can be used to find potential defects in the source code, opportunities, and violations of common coding standards.
We gathered a dataset of 74 open-source Machine Learning projects, installed their dependencies and ran Pylint on them.
This resulted in a top 20 of all detected code smells, per category.
arXiv Detail & Related papers (2021-03-06T16:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.