Related papers: Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning Projects

Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning Projects

URL: http://arxiv.org/abs/2402.16546v1
Date: Mon, 26 Feb 2024 13:08:44 GMT
Title: Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning Projects
Authors: Han Wang, Sijia Yu, Chunyang Chen, Burak Turhan, Xiaodong Zhu
Abstract summary: Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness. It is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems. We empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub.
Score: 24.712437703214547
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness. However, it is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems. Therefore, we empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub. We find that: 1) unit tested DL projects have positive correlation with the open-source project metrics and have a higher acceptance rate of pull requests, 2) 68% of the sampled DL projects are not unit tested at all, 3) the layer and utilities (utils) of DL models have the most unit tests. Based on these findings and previous research outcomes, we built a mapping taxonomy between unit tests and faults in DL projects. We discuss the implications of our findings for developers and researchers and highlight the need for unit testing in open-source DL projects to ensure their reliability and stability. The study contributes to this community by raising awareness of the importance of unit testing in DL projects and encouraging further research in this area.

Related papers

Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency [2.4936576553283283]
The integration of Large Language Models (LLMs) into software engineering has shown potential to enhance productivity. This paper investigates whether LLM support improves defect detection effectiveness during unit testing.
arXiv Detail & Related papers (2025-02-13T22:27:55Z)
Mock Deep Testing: Toward Separate Development of Data and Models for Deep Learning [21.563130049562357]
This research introduces our methodology of mock deep testing for unit testing of deep learning applications. To enable unit testing, we introduce a design paradigm that decomposes the workflow into distinct, manageable components. We have developed KUnit, a framework for enabling mock deep testing for the Keras library.
arXiv Detail & Related papers (2025-02-11T17:11:11Z)
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs. ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z)
A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing [8.22619177301814]
Large Language Models (LLMs) have shown potential in various unit testing tasks. We present a large-scale empirical study on fine-tuning LLMs for unit testing.
arXiv Detail & Related papers (2024-12-21T13:28:11Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course [1.553083901660282]
Testing plays an important role in securing the success of a software development project. We investigate whether we can quantify the effects various types of testing have on functional suitability.
arXiv Detail & Related papers (2024-08-22T04:23:51Z)
A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites [1.4563527353943984]
Large Language Models (LLMs) have been applied to various aspects of software development. We present AgoneTest: an automated system for generating test suites for Java projects.
arXiv Detail & Related papers (2024-08-14T23:02:16Z)
A Tale of Two DL Cities: When Library Tests Meet Compiler [12.751626834965231]
We propose OPERA to extract domain knowledge from the test inputs for DL libraries. OPERA constructs diverse tests from the various test inputs for DL libraries. It incorporates a diversity-based test prioritization strategy to migrate and execute those test inputs.
arXiv Detail & Related papers (2024-07-23T16:35:45Z)
Automatic benchmarking of large multimodal models via iterative experiment programming [71.78089106671581]
We present APEx, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions.
arXiv Detail & Related papers (2024-06-18T06:43:46Z)
Testing in the Evolving World of DL Systems:Insights from Python GitHub Projects [4.171555557592296]
This research investigates testing practices within DL projects in GitHub. It focuses on aspects like test automation, the types of tests (e.g., unit, integration, and system), test suite growth rate, and evolution of testing practices across different project versions.
arXiv Detail & Related papers (2024-05-30T11:58:05Z)
How is Testing Related to Single Statement Bugs? [0.25782420501870285]
We analyzed data from the top 100 Maven-based projects on GitHub. Our preliminary findings suggest a weak to moderate correlation, indicating that increased test coverage is somewhat reduce the occurrence of SSBs.
arXiv Detail & Related papers (2024-03-27T03:31:00Z)
DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle. Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
DevEval: Evaluating Code Generation in Practical Software Projects [52.16841274646796]
We propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects. We assess five popular LLMs on DevEval and reveal their actual abilities in code generation.
arXiv Detail & Related papers (2024-01-12T06:51:30Z)
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements. This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models. We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z)
Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking. This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z)
A Survey of Deep Active Learning [54.376820959917005]
Active learning (AL) attempts to maximize the performance gain of the model by marking the fewest samples. Deep learning (DL) is greedy for data and requires a large amount of data supply to optimize massive parameters. Deep active learning (DAL) has emerged.
arXiv Detail & Related papers (2020-08-30T04:28:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.