Characterizing Bugs in Python and R Data Analytics Programs
- URL: http://arxiv.org/abs/2306.08632v1
- Date: Wed, 14 Jun 2023 16:50:01 GMT
- Title: Characterizing Bugs in Python and R Data Analytics Programs
- Authors: Shibbir Ahmed, Mohammad Wardat, Hamid Bagheri, Breno Dantas Cruz,
Hridesh Rajan
- Abstract summary: We present a comprehensive study of 5,068 Stack Overflow posts, 1,800 bug fix commits from GitHub repositories, and GitHub issues of the most used libraries to understand bugs in R and Python.
While both R and Python have bugs due to inexperience with data analysis, Python see significantly larger data preprocessing bugs compared to R.
While R has a slightly higher readability barrier for data analysts, the statistical power of R leads to a less number of bad performance bugs.
- Score: 17.4802455319052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: R and Python are among the most popular languages used in many critical data
analytics tasks. However, we still do not fully understand the capabilities of
these two languages w.r.t. bugs encountered in data analytics tasks. What type
of bugs are common? What are the main root causes? What is the relation between
bugs and root causes? How to mitigate these bugs? We present a comprehensive
study of 5,068 Stack Overflow posts, 1,800 bug fix commits from GitHub
repositories, and several GitHub issues of the most used libraries to
understand bugs in R and Python. Our key findings include: while both R and
Python have bugs due to inexperience with data analysis, Python see
significantly larger data preprocessing bugs compared to R. Developers
experience significantly more data flow bugs in R because intermediate results
are often implicit. We also found changes and bugs in packages and libraries
cause more bugs in R compared to Python while package or library misselection
and conflicts cause more bugs in Python than R. While R has a slightly higher
readability barrier for data analysts, the statistical power of R leads to a
less number of bad performance bugs. In terms of data visualization, R packages
have significantly more bugs than Python libraries. We also identified a strong
correlation between comparable packages in R and Python despite their
linguistic and methodological differences. Lastly, we contribute a large
dataset of manually verified R and Python bugs.
Related papers
- KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - Causal-learn: Causal Discovery in Python [53.17423883919072]
Causal discovery aims at revealing causal relations from observational data.
$textitcausal-learn$ is an open-source Python library for causal discovery.
arXiv Detail & Related papers (2023-07-31T05:00:35Z) - An Empirical Study on Bugs Inside PyTorch: A Replication Study [10.848682558737494]
We characterize bugs in the PyTorch library, a very popular deep learning framework.
Our results highlight that PyTorch bugs are more like traditional software projects bugs, than related to deep learning characteristics.
arXiv Detail & Related papers (2023-07-25T19:23:55Z) - Understanding Bugs in Multi-Language Deep Learning Frameworks [12.524231041454044]
Deep learning frameworks (DLFs) are suffering from bugs caused by the use of multiple programming languages (PLs)
We analyzed 1497 bugs in three MPL DLFs, namely MXNet, PyTorch and MXNet.
PL combination Python and C/C++ is most used in fixing more than 92% MPL bugs in all DLFs.
arXiv Detail & Related papers (2023-03-05T15:19:37Z) - Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z) - DADApy: Distance-based Analysis of DAta-manifolds in Python [51.37841707191944]
DADApy is a python software package for analysing and characterising high-dimensional data.
It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics.
arXiv Detail & Related papers (2022-05-04T08:41:59Z) - PyHHMM: A Python Library for Heterogeneous Hidden Markov Models [63.01207205641885]
PyHHMM is an object-oriented Python implementation of Heterogeneous-Hidden Markov Models (HHMMs)
PyHHMM emphasizes features not supported in similar available frameworks: a heterogeneous observation model, missing data inference, different model order selection criterias, and semi-supervised training.
PyHHMM relies on the numpy, scipy, scikit-learn, and seaborn Python packages, and is distributed under the Apache-2.0 License.
arXiv Detail & Related papers (2022-01-12T07:32:36Z) - Identifying non-natural language artifacts in bug reports [1.464410818828473]
We present a machine learning based approach to classify content into natural language and artifacts at line level in Python.
We show how data from GitHub issue trackers can be used for automated training set generation.
Our model scores at 0.95 ROC-AUC and 0.93 F1 against our manually annotated validation set, and classifies 10k lines in 0.72 seconds.
arXiv Detail & Related papers (2021-10-04T11:33:51Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z) - Advaita: Bug Duplicity Detection System [1.9624064951902522]
Duplicate bugs rate (% of duplicate bugs) are in the range from single digit (1 to 9%) to double digits (40%) based on the product maturity, size of the code and number of engineers working on the project.
Detecting duplicity deals with identifying whether any two bugs convey the same meaning.
This approach considers multiple sets of features viz. basic text statistical features, semantic features and contextual features.
arXiv Detail & Related papers (2020-01-24T04:48:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.