An Empirical Study of API Misuses of Data-Centric Libraries
- URL: http://arxiv.org/abs/2408.15853v1
- Date: Wed, 28 Aug 2024 15:15:52 GMT
- Title: An Empirical Study of API Misuses of Data-Centric Libraries
- Authors: Akalanka Galappaththi, Sarah Nadi, Christoph Treude,
- Abstract summary: This paper contributes an empirical study of API misuses of five data-centric libraries that cover areas such as data processing, numerical computation, machine learning, and visualization.
We identify misuses of these libraries by analyzing data from both Stack Overflow and GitHub.
- Score: 9.667988837321943
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developers rely on third-party library Application Programming Interfaces (APIs) when developing software. However, libraries typically come with assumptions and API usage constraints, whose violation results in API misuse. API misuses may result in crashes or incorrect behavior. Even though API misuse is a well-studied area, a recent study of API misuse of deep learning libraries showed that the nature of these misuses and their symptoms are different from misuses of traditional libraries, and as a result highlighted potential shortcomings of current misuse detection tools. We speculate that these observations may not be limited to deep learning API misuses but may stem from the data-centric nature of these APIs. Data-centric libraries often deal with diverse data structures, intricate processing workflows, and a multitude of parameters, which can make them inherently more challenging to use correctly. Therefore, understanding the potential misuses of these libraries is important to avoid unexpected application behavior. To this end, this paper contributes an empirical study of API misuses of five data-centric libraries that cover areas such as data processing, numerical computation, machine learning, and visualization. We identify misuses of these libraries by analyzing data from both Stack Overflow and GitHub. Our results show that many of the characteristics of API misuses observed for deep learning libraries extend to misuses of the data-centric library APIs we study. We also find that developers tend to misuse APIs from data-centric libraries, regardless of whether the API directive appears in the documentation. Overall, our work exposes the challenges of API misuse in data-centric libraries, rather than only focusing on deep learning libraries. Our collected misuses and their characterization lay groundwork for future research to help reduce misuses of these libraries.
Related papers
- A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How [53.65636914757381]
API suggestion is a critical task in modern software development.
Recent advancements in large code models (LCMs) have shown promise in the API suggestion task.
arXiv Detail & Related papers (2024-09-20T03:12:35Z) - A Solution-based LLM API-using Methodology for Academic Information Seeking [49.096714812902576]
SoAy is a solution-based LLM API-using methodology for academic information seeking.
It uses code with a solution as the reasoning method, where a solution is a pre-constructed API calling sequence.
Results show a 34.58-75.99% performance improvement compared to state-of-the-art LLM API-based baselines.
arXiv Detail & Related papers (2024-05-24T02:44:14Z) - Lightweight Syntactic API Usage Analysis with UCov [0.0]
We present a novel conceptual framework designed to assist library maintainers in understanding the interactions allowed by their APIs.
These customizable models enable library maintainers to improve their design ahead of release, reducing friction during evolution.
We implement these models for Java libraries in a new tool UCov and demonstrate its capabilities on three libraries exhibiting diverse styles of interaction.
arXiv Detail & Related papers (2024-02-19T10:33:41Z) - Causal-learn: Causal Discovery in Python [53.17423883919072]
Causal discovery aims at revealing causal relations from observational data.
$textitcausal-learn$ is an open-source Python library for causal discovery.
arXiv Detail & Related papers (2023-07-31T05:00:35Z) - Private-Library-Oriented Code Generation with Large Language Models [52.73999698194344]
This paper focuses on utilizing large language models (LLMs) for code generation in private libraries.
We propose a novel framework that emulates the process of programmers writing private code.
We create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval.
arXiv Detail & Related papers (2023-07-28T07:43:13Z) - An Empirical Study on Bugs Inside PyTorch: A Replication Study [10.848682558737494]
We characterize bugs in the PyTorch library, a very popular deep learning framework.
Our results highlight that PyTorch bugs are more like traditional software projects bugs, than related to deep learning characteristics.
arXiv Detail & Related papers (2023-07-25T19:23:55Z) - LibAUC: A Deep Learning Library for X-Risk Optimization [43.32145407575245]
This paper introduces the award-winning deep learning (DL) library called LibAUC.
LibAUC implements state-of-the-art algorithms towards optimizing a family of risk functions named X-risks.
arXiv Detail & Related papers (2023-06-05T17:43:46Z) - torchgfn: A PyTorch GFlowNet library [56.071033896777784]
torchgfn is a PyTorch library that aims to address this need.
It provides users with a simple API for environments and useful abstractions for samplers and losses.
arXiv Detail & Related papers (2023-05-24T00:20:59Z) - SequeL: A Continual Learning Library in PyTorch and JAX [50.33956216274694]
SequeL is a library for Continual Learning that supports both PyTorch and JAX frameworks.
It provides a unified interface for a wide range of Continual Learning algorithms, including regularization-based approaches, replay-based approaches, and hybrid approaches.
We release SequeL as an open-source library, enabling researchers and developers to easily experiment and extend the library for their own purposes.
arXiv Detail & Related papers (2023-04-21T10:00:22Z) - Do Not Take It for Granted: Comparing Open-Source Libraries for Software
Development Effort Estimation [9.224578642189023]
This paper aims at raising awareness of the differences incurred when using different Machine Learning (ML) libraries for software development effort estimation (SEE)
We investigate 4 deterministic machine learners as provided by 3 of the most popular ML open-source libraries written in different languages (namely, Scikit-Learn, Caret and Weka)
The results of our study reveal that the predictions provided by the 3 libraries differ in 95% of the cases on average across a total of 105 cases studied.
arXiv Detail & Related papers (2022-07-04T20:06:40Z) - giotto-tda: A Topological Data Analysis Toolkit for Machine Learning and
Data Exploration [4.8353738137338755]
giotto-tda is a Python library that integrates high-performance topological data analysis with machine learning.
The library's ability to handle various types of data is rooted in a wide range of preprocessing techniques.
arXiv Detail & Related papers (2020-04-06T10:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.