Related papers: A manual categorization of new quality issues on automatically-generated tests

A manual categorization of new quality issues on automatically-generated tests

URL: http://arxiv.org/abs/2312.08826v1
Date: Thu, 14 Dec 2023 11:19:14 GMT
Title: A manual categorization of new quality issues on automatically-generated tests
Authors: Geraldine Galindo-Gutierrez, Narea Maxilimiliano, Blanco Alison Fernandez, Nicolas Anquetil, Alcocer Juan Pablo Sandoval
Abstract summary: We report on a manual analysis of an external dataset consisting of 2,340 automatically generated tests. We propose a taxonomy of 13 new quality issues grouped in four categories. We present eight recommendations that test generators may consider to improve the quality and usefulness of the automatically generated tests.
Score: 0.8225289576465757
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diverse studies have analyzed the quality of automatically generated test cases by using test smells as the main quality attribute. But recent work reported that generated tests may suffer a number of quality issues not necessarily considered in previous studies. Little is known about these issues and their frequency within generated tests. In this paper, we report on a manual analysis of an external dataset consisting of 2,340 automatically generated tests. This analysis aimed at detecting new quality issues, not covered by past recognized test smells. We use thematic analysis to group and categorize the new quality issues found. As a result, we propose a taxonomy of 13 new quality issues grouped in four categories. We also report on the frequency of these new quality issues within the dataset and present eight recommendations that test generators may consider to improve the quality and usefulness of the automatically generated tests.

Related papers

Evaluating LLM-Generated Q&A Test: a Student-Centered Study [0.06749750044497731]
We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts.<n>A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality.
arXiv Detail & Related papers (2025-05-10T10:47:23Z)
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation [13.202947148434333]
We introduce test item analysis, a method frequently used to assess test question quality, into QG evaluation. We construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation.
arXiv Detail & Related papers (2025-03-07T19:21:59Z)
CritiQ: Mining Data Quality Criteria from Human Preferences [70.35346554179036]
We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality. CritiQ Flow employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We demonstrate the effectiveness of our method in the code, math, and logic domains.
arXiv Detail & Related papers (2025-02-26T16:33:41Z)
Less is More: On the Importance of Data Quality for Unit Test Generation [15.396524026122972]
Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation.
arXiv Detail & Related papers (2025-02-20T02:47:09Z)
An Automatic Question Usability Evaluation Toolkit [1.2499537119440245]
evaluating multiple-choice questions (MCQs) involves either labor intensive human assessments or automated methods that prioritize readability. We introduce SAQUET, an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automated quality evaluation of MCQs. With an accuracy rate of over 94%, our findings emphasize the limitations of existing evaluation methods and showcase potential in improving the quality of educational assessments.
arXiv Detail & Related papers (2024-05-30T23:04:53Z)
Assessing test artifact quality -- A tertiary study [1.7827643249624088]
We have carried out a systematic literature review to identify and analyze existing secondary studies on quality aspects of software testing artifacts. We present an aggregation of the context dimensions and factors that can be used to characterize the environment in which the test case/suite quality is investigated.
arXiv Detail & Related papers (2024-02-14T19:31:57Z)
Enriching Automatic Test Case Generation by Extracting Relevant Test Inputs from Bug Reports [8.85274953789614]
name is a technique for exploring bug reports to identify input values that can be fed to automatic test generation tools. For Defects4J projects, our study has shown that name successfully extracted 68.68% of relevant inputs when using regular expression in its approach.
arXiv Detail & Related papers (2023-12-22T18:19:33Z)
BAND-2k: Banding Artifact Noticeable Database for Banding Detection and Quality Assessment [52.1640725073183]
Banding, also known as staircase-like contours, frequently occurs in flat areas of images/videos processed by the compression or quantization algorithms. We build the largest banding IQA database so far, named Banding Artifact Noticeable Database (BAND-2k), which consists of 2,000 banding images. A dual convolutional neural network is employed to concurrently learn the feature representation from the high-frequency and low-frequency maps.
arXiv Detail & Related papers (2023-11-29T15:56:31Z)
Test-Case Quality -- Understanding Practitioners' Perspectives [1.7827643249624088]
We present a quality model which consists of 11 test-case quality attributes. We identify a misalignment in defining test-case quality among practitioners and between academia and industry.
arXiv Detail & Related papers (2023-09-28T19:10:01Z)
Manual Tests Do Smell! Cataloging and Identifying Natural Language Test Smells [1.43994708364763]
Test smells indicate potential problems in the design and implementation of automated software tests. This study aims to contribute to a catalog of test smells for manual tests.
arXiv Detail & Related papers (2023-08-02T19:05:36Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes. We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z)
Make an Omelette with Breaking Eggs: Zero-Shot Learning for Novel Attribute Synthesis [65.74825840440504]
We propose Zero Shot Learning for Attributes (ZSLA), which is the first of its kind to the best of our knowledge. Our proposed method is able to synthesize the detectors of novel attributes in a zero-shot learning manner. With using only 32 seen attributes on the Caltech-UCSD Birds-200-2011 dataset, our proposed method is able to synthesize other 207 novel attributes.
arXiv Detail & Related papers (2021-11-28T15:45:54Z)
A New Score for Adaptive Tests in Bayesian and Credal Networks [64.80185026979883]
A test is adaptive when its sequence and number of questions is dynamically tuned on the basis of the estimated skills of the taker. We present an alternative family of scores, based on the mode of the posterior probabilities, and hence easier to explain.
arXiv Detail & Related papers (2021-05-25T20:35:42Z)
Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study [86.62171568318716]
Large generative language models such as GPT-2 are well-known for their ability to generate text. We show that unsupervised predictors of "page quality" emerge, able to detect low quality content without any training. We conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.
arXiv Detail & Related papers (2020-08-17T07:13:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.