Related papers: What Causes Exceptions in Machine Learning Applications? Mining Machine Learning-Related Stack Traces on Stack Overflow

What Causes Exceptions in Machine Learning Applications? Mining Machine Learning-Related Stack Traces on Stack Overflow

URL: http://arxiv.org/abs/2304.12857v1
Date: Tue, 25 Apr 2023 14:29:07 GMT
Title: What Causes Exceptions in Machine Learning Applications? Mining Machine Learning-Related Stack Traces on Stack Overflow
Authors: Amin Ghadesi, and Maxime Lamothe, and Heng Li
Abstract summary: We study 11,449 stack traces related to seven popular Python ML libraries on Stack Overflow. ML questions that contain stack traces gain more popularity than questions without stack traces. Patterns related to subprocess invocation, external module execution, and remote API call are among the least likely to get accepted answers.
Score: 6.09414932258309
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Machine learning (ML), including deep learning, has recently gained tremendous popularity in a wide range of applications. However, like traditional software, ML applications are not immune to the bugs that result from programming errors. Explicit programming errors usually manifest through error messages and stack traces. These stack traces describe the chain of function calls that lead to an anomalous situation, or exception. Indeed, these exceptions may cross the entire software stack (including applications and libraries). Thus, studying the patterns in stack traces can help practitioners and researchers understand the causes of exceptions in ML applications and the challenges faced by ML developers. To that end, we mine Stack Overflow (SO) and study 11,449 stack traces related to seven popular Python ML libraries. First, we observe that ML questions that contain stack traces gain more popularity than questions without stack traces; however, they are less likely to get accepted answers. Second, we observe that recurrent patterns exists in ML stack traces, even across different ML libraries, with a small portion of patterns covering many stack traces. Third, we derive five high-level categories and 25 low-level types from the stack trace patterns: most patterns are related to python basic syntax, model training, parallelization, data transformation, and subprocess invocation. Furthermore, the patterns related to subprocess invocation, external module execution, and remote API call are among the least likely to get accepted answers on SO. Our findings provide insights for researchers, ML library providers, and ML application developers to improve the quality of ML libraries and their applications.

Related papers

Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors. Traditional fuzzing struggles with the complexity and API diversity of DL libraries. We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z)
ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution. We show that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving an absolute increase of 11.24% over niave RAG approaches and 14.07% over pretraining methods in pass@10.
arXiv Detail & Related papers (2024-12-06T19:00:15Z)
KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z)
TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs) We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES. Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z)
What are the Machine Learning best practices reported by practitioners on Stack Exchange? [4.882319198853359]
We present a study listing 127 Machine Learning best practices systematically mining 242 posts of 14 different Stack Exchange (STE) websites. The list of practices is presented in a set of categories related to different stages of the implementation process of an ML-enabled system.
arXiv Detail & Related papers (2023-01-25T10:50:28Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
Comparative analysis of real bugs in open-source Machine Learning projects -- A Registered Report [5.275804627373337]
We investigate whether there is a discrepancy in the distribution of resolution time between Machine Learning and non-ML issues. We measure the resolution time and size of fix of ML and non-ML issues on a controlled sample and compare the distributions for each category of issue.
arXiv Detail & Related papers (2022-09-20T18:12:12Z)
Bugs in Machine Learning-based Systems: A Faultload Benchmark [16.956588187947993]
There is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the verifiability of the bugs in ML-based systems and show the most important factors in each one. We provide a benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, fairness, verifiability, and usability.
arXiv Detail & Related papers (2022-06-24T14:20:34Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
OmniXAI: A Library for Explainable AI [98.07381528393245]
We introduce OmniXAI, an open-source Python library of eXplainable AI (XAI) It offers omni-way explainable AI capabilities and various interpretable machine learning techniques. For practitioners, the library provides an easy-to-use unified interface to generate the explanations for their applications.
arXiv Detail & Related papers (2022-06-01T11:35:37Z)
The Prevalence of Code Smells in Machine Learning projects [9.722159563454436]
static code analysis can be used to find potential defects in the source code, opportunities, and violations of common coding standards. We gathered a dataset of 74 open-source Machine Learning projects, installed their dependencies and ran Pylint on them. This resulted in a top 20 of all detected code smells, per category.
arXiv Detail & Related papers (2021-03-06T16:01:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.