Test case quality: an empirical study on belief and evidence
- URL: http://arxiv.org/abs/2307.06410v1
- Date: Wed, 12 Jul 2023 19:02:48 GMT
- Title: Test case quality: an empirical study on belief and evidence
- Authors: Daniel Lucr\'edio, Auri Marcelo Rizzo Vincenzi, Eduardo Santana de
Almeida, Iftekhar Ahmed
- Abstract summary: We investigate eight hypotheses regarding what constitutes a good test case.
Despite our best efforts, we were unable to find evidence that supports these beliefs.
- Score: 8.475270520855332
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Software testing is a mandatory activity in any serious software development
process, as bugs are a reality in software development. This raises the
question of quality: good tests are effective in finding bugs, but until a test
case actually finds a bug, its effectiveness remains unknown. Therefore,
determining what constitutes a good or bad test is necessary. This is not a
simple task, and there are a number of studies that identify different
characteristics of a good test case. A previous study evaluated 29 hypotheses
regarding what constitutes a good test case, but the findings are based on
developers' beliefs, which are subjective and biased. In this paper we
investigate eight of these hypotheses, through an extensive empirical study
based on open software repositories. Despite our best efforts, we were unable
to find evidence that supports these beliefs. This indicates that, although
these hypotheses represent good software engineering advice, they do not
necessarily mean that they are enough to provide the desired outcome of good
testing code.
Related papers
- Design choices made by LLM-based test generators prevent them from finding bugs [0.850206009406913]
This paper critically examines whether recent LLM-based test generation tools, such as Codium CoverAgent and CoverUp, can effectively find bugs or unintentionally validate faulty code.
Using real human-written buggy code as input, we evaluate these tools, showing how LLM-generated tests can fail to detect bugs and, more alarmingly, how their design can worsen the situation by validating bugs in the generated test suite and rejecting bug-revealing tests.
arXiv Detail & Related papers (2024-12-18T18:33:26Z) - System Test Case Design from Requirements Specifications: Insights and Challenges of Using ChatGPT [1.9282110216621835]
This paper explores the effectiveness of using Large Language Models (LLMs) to generate test case designs from Software Requirements Specification (SRS) documents.
About 87 percent of the generated test cases were valid, with the remaining 13 percent either not applicable or redundant.
arXiv Detail & Related papers (2024-12-04T20:12:27Z) - Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code.
We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z) - Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail.
The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure.
Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z) - A Survey on What Developers Think About Testing [13.086283144520513]
We conducted a comprehensive survey with 21 questions aimed at assessing developers' current engagement with testing.
We uncover reasons that positively and negatively impact developers' motivation to test.
One approach emerging from the responses to mitigate these negative factors is by providing better recognition for developers' testing efforts.
arXiv Detail & Related papers (2023-09-03T12:18:41Z) - When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP [23.30735117217225]
We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture.
We propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models.
arXiv Detail & Related papers (2023-03-28T17:28:52Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z) - Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement
Learning Framework [68.96770035057716]
A/B testing is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries.
This paper introduces a reinforcement learning framework for carrying A/B testing in online experiments.
arXiv Detail & Related papers (2020-02-05T10:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.