On the Effectiveness of LLMs for Manual Test Verifications
- URL: http://arxiv.org/abs/2409.12405v1
- Date: Thu, 19 Sep 2024 02:03:04 GMT
- Title: On the Effectiveness of LLMs for Manual Test Verifications
- Authors: Myron David Lucena Campos Peixoto, Davy de Medeiros Baia, Nathalia Nascimento, Paulo Alencar, Baldoino Fonseca, Márcio Ribeiro,
- Abstract summary: This study aims to explore the use of Large Language Models (LLMs) to produce verifications for manual tests.
Open-source models Mistral-7B and Phi-3-mini-4k demonstrated effectiveness and consistency comparable to closed-source models.
There were also concerns about AI hallucinations, where verifications significantly deviated from expectations.
- Score: 1.920300814128832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Manual testing is vital for detecting issues missed by automated tests, but specifying accurate verifications is challenging. Aims: This study aims to explore the use of Large Language Models (LLMs) to produce verifications for manual tests. Method: We conducted two independent and complementary exploratory studies. The first study involved using 2 closed-source and 6 open-source LLMs to generate verifications for manual test steps and evaluate their similarity to original verifications. The second study involved recruiting software testing professionals to assess their perception and agreement with the generated verifications compared to the original ones. Results: The open-source models Mistral-7B and Phi-3-mini-4k demonstrated effectiveness and consistency comparable to closed-source models like Gemini-1.5-flash and GPT-3.5-turbo in generating manual test verifications. However, the agreement level among professional testers was slightly above 40%, indicating both promise and room for improvement. While some LLM-generated verifications were considered better than the originals, there were also concerns about AI hallucinations, where verifications significantly deviated from expectations. Conclusion: We contributed by generating a dataset of 37,040 test verifications using 8 different LLMs. Although the models show potential, the relatively modest 40% agreement level highlights the need for further refinement. Enhancing the accuracy, relevance, and clarity of the generated verifications is crucial to ensure greater reliability in real-world testing scenarios.
Related papers
- Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing [31.165102332393964]
Large language model-powered Automated Program Repair (LAPR) techniques have achieved state-of-the-art bug-fixing performance.
It is crucial to conduct robustness testing on LAPR techniques before their practical deployment.
We propose MT-LAPR, a Metamorphic Testing framework exclusively for LAPR techniques.
arXiv Detail & Related papers (2024-10-10T01:14:58Z) - Knowledge-based Consistency Testing of Large Language Models [2.9699290794642366]
We systematically expose and measure the inconsistency and knowledge gaps of Large Language Models (LLMs)
We propose an automated testing framework (called KonTest) which leverages a knowledge graph to construct test cases.
Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.
arXiv Detail & Related papers (2024-07-03T11:16:54Z) - Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.056044348209483]
Unit testing, crucial for identifying bugs in code modules like classes and methods, is often neglected by developers due to time constraints.
Large Language Models (LLMs), like GPT and Mistral, show promise in software engineering, including in test generation.
arXiv Detail & Related papers (2024-06-28T20:38:41Z) - UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models.
It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities.
We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study [0.28318468414401093]
This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews.
Overall, results indicated an accuracy of around 80%, with some variability between domains.
arXiv Detail & Related papers (2024-05-23T11:24:23Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - NoMIRACL: Knowing When You Don't Know for Robust Multilingual
Retrieval-Augmented Generation [92.5132418788568]
Retrieval-augmented generation (RAG) grounds large language model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations.
NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages.
We measure robustness using two metrics: (i) hallucination rate, measuring model tendency to hallucinate an answer, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z) - Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models.
Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.