Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep
Learning Projects
- URL: http://arxiv.org/abs/2402.16546v1
- Date: Mon, 26 Feb 2024 13:08:44 GMT
- Title: Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep
Learning Projects
- Authors: Han Wang, Sijia Yu, Chunyang Chen, Burak Turhan, Xiaodong Zhu
- Abstract summary: Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness.
It is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems.
We empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub.
- Score: 24.712437703214547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Learning (DL) models have rapidly advanced, focusing on achieving high
performance through testing model accuracy and robustness. However, it is
unclear whether DL projects, as software systems, are tested thoroughly or
functionally correct when there is a need to treat and test them like other
software systems. Therefore, we empirically study the unit tests in open-source
DL projects, analyzing 9,129 projects from GitHub. We find that: 1) unit tested
DL projects have positive correlation with the open-source project metrics and
have a higher acceptance rate of pull requests, 2) 68% of the sampled DL
projects are not unit tested at all, 3) the layer and utilities (utils) of DL
models have the most unit tests. Based on these findings and previous research
outcomes, we built a mapping taxonomy between unit tests and faults in DL
projects. We discuss the implications of our findings for developers and
researchers and highlight the need for unit testing in open-source DL projects
to ensure their reliability and stability. The study contributes to this
community by raising awareness of the importance of unit testing in DL projects
and encouraging further research in this area.
Related papers
- Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency [2.4936576553283283]
The integration of Large Language Models (LLMs) into software engineering has shown potential to enhance productivity.
This paper investigates whether LLM support improves defect detection effectiveness during unit testing.
arXiv Detail & Related papers (2025-02-13T22:27:55Z) - Mock Deep Testing: Toward Separate Development of Data and Models for Deep Learning [21.563130049562357]
This research introduces our methodology of mock deep testing for unit testing of deep learning applications.
To enable unit testing, we introduce a design paradigm that decomposes the workflow into distinct, manageable components.
We have developed KUnit, a framework for enabling mock deep testing for the Keras library.
arXiv Detail & Related papers (2025-02-11T17:11:11Z) - ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.
ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z) - A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing [8.22619177301814]
Large Language Models (LLMs) have shown potential in various unit testing tasks.
We present a large-scale empirical study on fine-tuning LLMs for unit testing.
arXiv Detail & Related papers (2024-12-21T13:28:11Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.
Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.
Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - A Tale of Two DL Cities: When Library Tests Meet Compiler [12.751626834965231]
We propose OPERA to extract domain knowledge from the test inputs for DL libraries.
OPERA constructs diverse tests from the various test inputs for DL libraries.
It incorporates a diversity-based test prioritization strategy to migrate and execute those test inputs.
arXiv Detail & Related papers (2024-07-23T16:35:45Z) - Automatic benchmarking of large multimodal models via iterative experiment programming [71.78089106671581]
We present APEx, the first framework for automatic benchmarking of LMMs.
Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand.
The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions.
arXiv Detail & Related papers (2024-06-18T06:43:46Z) - Testing in the Evolving World of DL Systems:Insights from Python GitHub Projects [4.171555557592296]
This research investigates testing practices within DL projects in GitHub.
It focuses on aspects like test automation, the types of tests (e.g., unit, integration, and system), test suite growth rate, and evolution of testing practices across different project versions.
arXiv Detail & Related papers (2024-05-30T11:58:05Z) - DevEval: Evaluating Code Generation in Practical Software Projects [52.16841274646796]
We propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects.
DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects.
We assess five popular LLMs on DevEval and reveal their actual abilities in code generation.
arXiv Detail & Related papers (2024-01-12T06:51:30Z) - LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean.
Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements.
This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models.
We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z) - A Survey of Deep Active Learning [54.376820959917005]
Active learning (AL) attempts to maximize the performance gain of the model by marking the fewest samples.
Deep learning (DL) is greedy for data and requires a large amount of data supply to optimize massive parameters.
Deep active learning (DAL) has emerged.
arXiv Detail & Related papers (2020-08-30T04:28:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.