Testing in the Evolving World of DL Systems:Insights from Python GitHub Projects
- URL: http://arxiv.org/abs/2405.19976v1
- Date: Thu, 30 May 2024 11:58:05 GMT
- Title: Testing in the Evolving World of DL Systems:Insights from Python GitHub Projects
- Authors: Qurban Ali, Oliviero Riganelli, Leonardo Mariani,
- Abstract summary: This research investigates testing practices within DL projects in GitHub.
It focuses on aspects like test automation, the types of tests (e.g., unit, integration, and system), test suite growth rate, and evolution of testing practices across different project versions.
- Score: 4.171555557592296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the ever-evolving field of Deep Learning (DL), ensuring project quality and reliability remains a crucial challenge. This research investigates testing practices within DL projects in GitHub. It quantifies the adoption of testing methodologies, focusing on aspects like test automation, the types of tests (e.g., unit, integration, and system), test suite growth rate, and evolution of testing practices across different project versions. We analyze a subset of 300 carefully selected repositories based on quantitative and qualitative criteria. This study reports insights on the prevalence of testing practices in DL projects within the open-source community.
Related papers
- TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs)
We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage.
We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z) - Elevating Software Quality in Agile Environments: The Role of Testing Professionals in Unit Testing [0.0]
Testing is an essential quality activity in the software development process.
This paper explores the participation of test engineers in unit testing within an industrial context.
arXiv Detail & Related papers (2024-03-20T00:41:49Z) - DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle.
Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench.
Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep
Learning Projects [24.712437703214547]
Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness.
It is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems.
We empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub.
arXiv Detail & Related papers (2024-02-26T13:08:44Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - An empirical study of testing machine learning in the wild [35.13282520395855]
Machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems.
Due to their inductive nature, ensuring the quality of these systems remains a significant challenge for the research community.
Recent research in ML/DL quality assurance has adapted concepts from traditional software testing, such as mutation testing, to improve reliability.
arXiv Detail & Related papers (2023-12-19T21:18:14Z) - A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [143.14128737978342]
Test-time adaptation, an emerging paradigm, has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions.
Recent progress in this paradigm highlights the significant benefits of utilizing unlabeled data for training self-adapted models prior to inference.
arXiv Detail & Related papers (2023-03-27T16:32:21Z) - NEVIS'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision
Research [96.53307645791179]
We introduce the Never-Ending VIsual-classification Stream (NEVIS'22), a benchmark consisting of a stream of over 100 visual classification tasks.
Despite being limited to classification, the resulting stream has a rich diversity of tasks from OCR, to texture analysis, scene recognition, and so forth.
Overall, NEVIS'22 poses an unprecedented challenge for current sequential learning approaches due to the scale and diversity of tasks.
arXiv Detail & Related papers (2022-11-15T18:57:46Z) - Fairness Testing: A Comprehensive Survey and Analysis of Trends [30.637712832450525]
Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers.
This paper offers a comprehensive survey of existing studies in this field.
arXiv Detail & Related papers (2022-07-20T22:41:38Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z) - Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement
Learning Framework [68.96770035057716]
A/B testing is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries.
This paper introduces a reinforcement learning framework for carrying A/B testing in online experiments.
arXiv Detail & Related papers (2020-02-05T10:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.