Related papers: An Empirical Study of API Misuses of Data-Centric Libraries

An Empirical Study of API Misuses of Data-Centric Libraries

URL: http://arxiv.org/abs/2408.15853v1
Date: Wed, 28 Aug 2024 15:15:52 GMT
Title: An Empirical Study of API Misuses of Data-Centric Libraries
Authors: Akalanka Galappaththi, Sarah Nadi, Christoph Treude,
Abstract summary: This paper contributes an empirical study of API misuses of five data-centric libraries that cover areas such as data processing, numerical computation, machine learning, and visualization. We identify misuses of these libraries by analyzing data from both Stack Overflow and GitHub.
Score: 9.667988837321943
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Developers rely on third-party library Application Programming Interfaces (APIs) when developing software. However, libraries typically come with assumptions and API usage constraints, whose violation results in API misuse. API misuses may result in crashes or incorrect behavior. Even though API misuse is a well-studied area, a recent study of API misuse of deep learning libraries showed that the nature of these misuses and their symptoms are different from misuses of traditional libraries, and as a result highlighted potential shortcomings of current misuse detection tools. We speculate that these observations may not be limited to deep learning API misuses but may stem from the data-centric nature of these APIs. Data-centric libraries often deal with diverse data structures, intricate processing workflows, and a multitude of parameters, which can make them inherently more challenging to use correctly. Therefore, understanding the potential misuses of these libraries is important to avoid unexpected application behavior. To this end, this paper contributes an empirical study of API misuses of five data-centric libraries that cover areas such as data processing, numerical computation, machine learning, and visualization. We identify misuses of these libraries by analyzing data from both Stack Overflow and GitHub. Our results show that many of the characteristics of API misuses observed for deep learning libraries extend to misuses of the data-centric library APIs we study. We also find that developers tend to misuse APIs from data-centric libraries, regardless of whether the API directive appears in the documentation. Overall, our work exposes the challenges of API misuse in data-centric libraries, rather than only focusing on deep learning libraries. Our collected misuses and their characterization lay groundwork for future research to help reduce misuses of these libraries.

Related papers

Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors. Traditional fuzzing struggles with the complexity and API diversity of DL libraries. We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z)
Subgraph-Oriented Testing for Deep Learning Libraries [9.78188667672054]
We propose SORT (Subgraph-Oriented Realistic Testing) to test Deep Learning (DL) libraries on different hardware platforms. SORT takes popular API interaction patterns, represented as frequent subgraphs of model graphs, as test subjects. SORT achieves a 100% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing.
arXiv Detail & Related papers (2024-12-09T12:10:48Z)
ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution. We show that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving an absolute increase of 11.24% over niave RAG approaches and 14.07% over pretraining methods in pass@10.
arXiv Detail & Related papers (2024-12-06T19:00:15Z)
The Seeds of the FUTURE Sprout from History: Fuzzing for Unveiling Vulnerabilities in Prospective Deep-Learning Libraries [14.260990784121423]
Future is the first universal fuzzing framework tailored for newly introduced and prospective DL libraries. It uses historical bug information from existing libraries and fine-tunes LLMs for specialized code generation. It significantly outperforms existing fuzzers in bug detection, success rate of bug reproduction, validity rate of code generation, and API coverage.
arXiv Detail & Related papers (2024-12-02T09:33:28Z)
A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How [53.65636914757381]
API suggestion is a critical task in modern software development. Recent advancements in large code models (LCMs) have shown promise in the API suggestion task.
arXiv Detail & Related papers (2024-09-20T03:12:35Z)
A Solution-based LLM API-using Methodology for Academic Information Seeking [49.096714812902576]
SoAy is a solution-based LLM API-using methodology for academic information seeking. It uses code with a solution as the reasoning method, where a solution is a pre-constructed API calling sequence. Results show a 34.58-75.99% performance improvement compared to state-of-the-art LLM API-based baselines.
arXiv Detail & Related papers (2024-05-24T02:44:14Z)
Lightweight Syntactic API Usage Analysis with UCov [0.0]
We present a novel conceptual framework designed to assist library maintainers in understanding the interactions allowed by their APIs. These customizable models enable library maintainers to improve their design ahead of release, reducing friction during evolution. We implement these models for Java libraries in a new tool UCov and demonstrate its capabilities on three libraries exhibiting diverse styles of interaction.
arXiv Detail & Related papers (2024-02-19T10:33:41Z)
Causal-learn: Causal Discovery in Python [53.17423883919072]
Causal discovery aims at revealing causal relations from observational data. $textitcausal-learn$ is an open-source Python library for causal discovery.
arXiv Detail & Related papers (2023-07-31T05:00:35Z)
Private-Library-Oriented Code Generation with Large Language Models [52.73999698194344]
This paper focuses on utilizing large language models (LLMs) for code generation in private libraries. We propose a novel framework that emulates the process of programmers writing private code. We create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval.
arXiv Detail & Related papers (2023-07-28T07:43:13Z)
An Empirical Study on Bugs Inside PyTorch: A Replication Study [10.848682558737494]
We characterize bugs in the PyTorch library, a very popular deep learning framework. Our results highlight that PyTorch bugs are more like traditional software projects bugs, than related to deep learning characteristics.
arXiv Detail & Related papers (2023-07-25T19:23:55Z)
LibAUC: A Deep Learning Library for X-Risk Optimization [43.32145407575245]
This paper introduces the award-winning deep learning (DL) library called LibAUC. LibAUC implements state-of-the-art algorithms towards optimizing a family of risk functions named X-risks.
arXiv Detail & Related papers (2023-06-05T17:43:46Z)
torchgfn: A PyTorch GFlowNet library [56.071033896777784]
torchgfn is a PyTorch library that aims to address this need. It provides users with a simple API for environments and useful abstractions for samplers and losses.
arXiv Detail & Related papers (2023-05-24T00:20:59Z)
SequeL: A Continual Learning Library in PyTorch and JAX [50.33956216274694]
SequeL is a library for Continual Learning that supports both PyTorch and JAX frameworks. It provides a unified interface for a wide range of Continual Learning algorithms, including regularization-based approaches, replay-based approaches, and hybrid approaches. We release SequeL as an open-source library, enabling researchers and developers to easily experiment and extend the library for their own purposes.
arXiv Detail & Related papers (2023-04-21T10:00:22Z)
Do Not Take It for Granted: Comparing Open-Source Libraries for Software Development Effort Estimation [9.224578642189023]
This paper aims at raising awareness of the differences incurred when using different Machine Learning (ML) libraries for software development effort estimation (SEE) We investigate 4 deterministic machine learners as provided by 3 of the most popular ML open-source libraries written in different languages (namely, Scikit-Learn, Caret and Weka) The results of our study reveal that the predictions provided by the 3 libraries differ in 95% of the cases on average across a total of 105 cases studied.
arXiv Detail & Related papers (2022-07-04T20:06:40Z)
giotto-tda: A Topological Data Analysis Toolkit for Machine Learning and Data Exploration [4.8353738137338755]
giotto-tda is a Python library that integrates high-performance topological data analysis with machine learning. The library's ability to handle various types of data is rooted in a wide range of preprocessing techniques.
arXiv Detail & Related papers (2020-04-06T10:53:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.