Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep
Learning Projects
- URL: http://arxiv.org/abs/2402.16546v1
- Date: Mon, 26 Feb 2024 13:08:44 GMT
- Title: Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep
Learning Projects
- Authors: Han Wang, Sijia Yu, Chunyang Chen, Burak Turhan, Xiaodong Zhu
- Abstract summary: Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness.
It is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems.
We empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub.
- Score: 24.712437703214547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Learning (DL) models have rapidly advanced, focusing on achieving high
performance through testing model accuracy and robustness. However, it is
unclear whether DL projects, as software systems, are tested thoroughly or
functionally correct when there is a need to treat and test them like other
software systems. Therefore, we empirically study the unit tests in open-source
DL projects, analyzing 9,129 projects from GitHub. We find that: 1) unit tested
DL projects have positive correlation with the open-source project metrics and
have a higher acceptance rate of pull requests, 2) 68% of the sampled DL
projects are not unit tested at all, 3) the layer and utilities (utils) of DL
models have the most unit tests. Based on these findings and previous research
outcomes, we built a mapping taxonomy between unit tests and faults in DL
projects. We discuss the implications of our findings for developers and
researchers and highlight the need for unit testing in open-source DL projects
to ensure their reliability and stability. The study contributes to this
community by raising awareness of the importance of unit testing in DL projects
and encouraging further research in this area.
Related papers
- Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course [1.553083901660282]
Testing plays an important role in securing the success of a software development project.
We investigate whether we can quantify the effects various types of testing have on functional suitability.
arXiv Detail & Related papers (2024-08-22T04:23:51Z) - A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites [1.4563527353943984]
Large Language Models (LLMs) have been applied to various aspects of software development.
We present AgoneTest: an automated system for generating test suites for Java projects.
arXiv Detail & Related papers (2024-08-14T23:02:16Z) - A Tale of Two DL Cities: When Library Tests Meet Compiler [12.751626834965231]
We propose OPERA to extract domain knowledge from the test inputs for DL libraries.
OPERA constructs diverse tests from the various test inputs for DL libraries.
It incorporates a diversity-based test prioritization strategy to migrate and execute those test inputs.
arXiv Detail & Related papers (2024-07-23T16:35:45Z) - Automatic benchmarking of large multimodal models via iterative experiment programming [71.78089106671581]
We present APEx, the first framework for automatic benchmarking of LMMs.
Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand.
The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions.
arXiv Detail & Related papers (2024-06-18T06:43:46Z) - Testing in the Evolving World of DL Systems:Insights from Python GitHub Projects [4.171555557592296]
This research investigates testing practices within DL projects in GitHub.
It focuses on aspects like test automation, the types of tests (e.g., unit, integration, and system), test suite growth rate, and evolution of testing practices across different project versions.
arXiv Detail & Related papers (2024-05-30T11:58:05Z) - How is Testing Related to Single Statement Bugs? [0.25782420501870285]
We analyzed data from the top 100 Maven-based projects on GitHub.
Our preliminary findings suggest a weak to moderate correlation, indicating that increased test coverage is somewhat reduce the occurrence of SSBs.
arXiv Detail & Related papers (2024-03-27T03:31:00Z) - DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle.
Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench.
Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - DevEval: Evaluating Code Generation in Practical Software Projects [52.16841274646796]
We propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects.
DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects.
We assess five popular LLMs on DevEval and reveal their actual abilities in code generation.
arXiv Detail & Related papers (2024-01-12T06:51:30Z) - LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean.
Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements.
This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models.
We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z) - A Survey of Deep Active Learning [54.376820959917005]
Active learning (AL) attempts to maximize the performance gain of the model by marking the fewest samples.
Deep learning (DL) is greedy for data and requires a large amount of data supply to optimize massive parameters.
Deep active learning (DAL) has emerged.
arXiv Detail & Related papers (2020-08-30T04:28:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.