Manual Tests Do Smell! Cataloging and Identifying Natural Language Test
Smells
- URL: http://arxiv.org/abs/2308.01386v1
- Date: Wed, 2 Aug 2023 19:05:36 GMT
- Title: Manual Tests Do Smell! Cataloging and Identifying Natural Language Test
Smells
- Authors: Elvys Soares, Manoel Aranda, Naelson Oliveira, M\'arcio Ribeiro, Rohit
Gheyi, Emerson Souza, Ivan Machado, Andr\'e Santos, Baldoino Fonseca, Rodrigo
Bonif\'acio
- Abstract summary: Test smells indicate potential problems in the design and implementation of automated software tests.
This study aims to contribute to a catalog of test smells for manual tests.
- Score: 1.43994708364763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background: Test smells indicate potential problems in the design and
implementation of automated software tests that may negatively impact test code
maintainability, coverage, and reliability. When poorly described, manual tests
written in natural language may suffer from related problems, which enable
their analysis from the point of view of test smells. Despite the possible
prejudice to manually tested software products, little is known about test
smells in manual tests, which results in many open questions regarding their
types, frequency, and harm to tests written in natural language. Aims:
Therefore, this study aims to contribute to a catalog of test smells for manual
tests. Method: We perform a two-fold empirical strategy. First, an exploratory
study in manual tests of three systems: the Ubuntu Operational System, the
Brazilian Electronic Voting Machine, and the User Interface of a large
smartphone manufacturer. We use our findings to propose a catalog of eight test
smells and identification rules based on syntactical and morphological text
analysis, validating our catalog with 24 in-company test engineers. Second,
using our proposals, we create a tool based on Natural Language Processing
(NLP) to analyze the subject systems' tests, validating the results. Results:
We observed the occurrence of eight test smells. A survey of 24 in-company test
professionals showed that 80.7% agreed with our catalog definitions and
examples. Our NLP-based tool achieved a precision of 92%, recall of 95%, and
f-measure of 93.5%, and its execution evidenced 13,169 occurrences of our
cataloged test smells in the analyzed systems. Conclusion: We contribute with a
catalog of natural language test smells and novel detection strategies that
better explore the capabilities of current NLP mechanisms with promising
results and reduced effort to analyze tests written in different idioms.
Related papers
- Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - Historical Test-time Prompt Tuning for Vision Foundation Models [99.96912440427192]
HisTPT is a Historical Test-time Prompt Tuning technique that memorizes the useful knowledge of the learnt test samples.
HisTPT achieves superior prompt tuning performance consistently while handling different visual recognition tasks.
arXiv Detail & Related papers (2024-10-27T06:03:15Z) - Test smells in LLM-Generated Unit Tests [11.517293765116307]
This study explores the diffusion of test smells in Large Language Models generated unit test suites.
We analyze a benchmark of 20,500 LLM-generated test suites produced by four models across five prompt engineering techniques.
We identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites.
arXiv Detail & Related papers (2024-10-14T15:35:44Z) - SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists [59.08999823652293]
We propose SYNTHEVAL to generate a wide range of test types for a comprehensive evaluation of NLP models.
In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit.
We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.
arXiv Detail & Related papers (2024-08-30T17:41:30Z) - Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases.
We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z) - Evaluating Large Language Models in Detecting Test Smells [1.5691664836504473]
The presence of test smells can negatively impact the maintainability and reliability of software.
This study aims to evaluate the capability of Large Language Models (LLMs) in automatically detecting test smells.
arXiv Detail & Related papers (2024-07-27T14:00:05Z) - A Catalog of Transformations to Remove Smells From Natural Language Tests [1.260984934917191]
Test smells can pose difficulties during testing activities, such as poor maintainability, non-deterministic behavior, and incomplete verification.
This paper introduces a catalog of transformations designed to remove seven natural language test smells and a companion tool implemented using Natural Language Processing (NLP) techniques.
arXiv Detail & Related papers (2024-04-25T19:23:24Z) - Towards General Error Diagnosis via Behavioral Testing in Machine
Translation [48.108393938462974]
This paper proposes a new framework for conducting behavioral testing of machine translation (MT) systems.
The core idea of BTPGBT is to employ a novel bilingual translation pair generation approach.
Experimental results on various MT systems demonstrate that BTPGBT could provide comprehensive and accurate behavioral testing results.
arXiv Detail & Related papers (2023-10-20T09:06:41Z) - Generating and Evaluating Tests for K-12 Students with Language Model
Simulations: A Case Study on Sentence Reading Efficiency [45.6224547703717]
This study focuses on tests of silent sentence reading efficiency, used to assess students' reading ability over time.
We propose to fine-tune large language models (LLMs) to simulate how previous students would have responded to unseen items.
We show the generated tests closely correspond to the original test's difficulty and reliability based on crowdworker responses.
arXiv Detail & Related papers (2023-10-10T17:59:51Z) - On the use of test smells for prediction of flaky tests [0.0]
flaky tests hamper the evaluation of test results and can increase costs.
Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting.
We investigate the use of test smells as predictors of flaky tests.
arXiv Detail & Related papers (2021-08-26T13:21:55Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.