Related papers: Test case quality: an empirical study on belief and evidence

Related papers

Intention-Driven Generation of Project-Specific Test Cases [14.297390481640068]
We propose IntentionTest, which generates project-specific tests with validation intention as a structured description.<n>We evaluate IntentionTest against state-of-the-art baselines on 4,146 test cases from 13 open-source projects.
arXiv Detail & Related papers (2025-07-28T08:35:04Z)
CodeContests+: High-Quality Test Case Generation for Competitive Programming [14.602111331209203]
We introduce an agent system that creates high-quality test cases for competitive programming problems.<n>We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+.<n>The results indicate that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR)
arXiv Detail & Related papers (2025-06-06T07:29:01Z)
TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z)
Existing Large Language Model Unlearning Evaluations Are Inconclusive [105.55899615056573]
We show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance.<n>We demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines.<n>We propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness.
arXiv Detail & Related papers (2025-05-31T19:43:00Z)
Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
This study is the first empirical study on early test termination due to assertion failure. We investigated 207 versions of 6 open-source projects. Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
arXiv Detail & Related papers (2025-04-06T17:14:09Z)
Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces [3.7752830020595796]
We curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases. The dataset comprises 288,617 hacks for 5,578 programming problems. The source code for 2,196 submitted solutions to these problems can be broken with their corresponding hacks.
arXiv Detail & Related papers (2025-03-30T14:50:03Z)
Design choices made by LLM-based test generators prevent them from finding bugs [0.850206009406913]
This paper critically examines whether recent LLM-based test generation tools, such as Codium CoverAgent and CoverUp, can effectively find bugs or unintentionally validate faulty code. Using real human-written buggy code as input, we evaluate these tools, showing how LLM-generated tests can fail to detect bugs and, more alarmingly, how their design can worsen the situation by validating bugs in the generated test suite and rejecting bug-revealing tests.
arXiv Detail & Related papers (2024-12-18T18:33:26Z)
System Test Case Design from Requirements Specifications: Insights and Challenges of Using ChatGPT [1.9282110216621835]
This paper explores the effectiveness of using Large Language Models (LLMs) to generate test case designs from Software Requirements Specification (SRS) documents. About 87 percent of the generated test cases were valid, with the remaining 13 percent either not applicable or redundant.
arXiv Detail & Related papers (2024-12-04T20:12:27Z)
Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code. We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z)
Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications. Most testing procedures still rely on test cases written by humans to form test suites. We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z)
Test-Case Quality -- Understanding Practitioners' Perspectives [1.7827643249624088]
We present a quality model which consists of 11 test-case quality attributes. We identify a misalignment in defining test-case quality among practitioners and between academia and industry.
arXiv Detail & Related papers (2023-09-28T19:10:01Z)
A Survey on What Developers Think About Testing [13.086283144520513]
We conducted a comprehensive survey with 21 questions aimed at assessing developers' current engagement with testing. We uncover reasons that positively and negatively impact developers' motivation to test. One approach emerging from the responses to mitigate these negative factors is by providing better recognition for developers' testing efforts.
arXiv Detail & Related papers (2023-09-03T12:18:41Z)
When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP [23.30735117217225]
We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. We propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models.
arXiv Detail & Related papers (2023-03-28T17:28:52Z)
Learning Deep Semantics for Test Completion [46.842174440120196]
We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test. We develop TeCo -- a deep learning model using code semantics for test completion.
arXiv Detail & Related papers (2023-02-20T18:53:56Z)
SUPERNOVA: Automating Test Selection and Defect Prevention in AAA Video Games Using Risk Based Testing and Machine Learning [62.997667081978825]
Testing video games is an increasingly difficult task as traditional methods fail to scale with growing software systems. We present SUPERNOVA, a system responsible for test selection and defect prevention while also functioning as an automation hub. The direct impact of this has been observed to be a reduction in 55% or more testing hours for an undisclosed sports game title.
arXiv Detail & Related papers (2022-03-10T00:47:46Z)
The Unpopularity of the Software Tester Role among Software Practitioners: A Case Study [10.028628621669293]
This work attempts to understand the motivation/de-motivation of software practitioners to take up and sustain testing careers. One hundred and forty four software practitioners from several Cuban software insti-tutes were surveyed. Individuals were asked the PROs (advantages or motiva-tors) and CONs (disadvantages or de-motivators) of taking up a career in soft-ware testing and their chances of doing so.
arXiv Detail & Related papers (2020-07-16T14:52:36Z)
Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation. In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually. Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework [68.96770035057716]
A/B testing is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. This paper introduces a reinforcement learning framework for carrying A/B testing in online experiments.
arXiv Detail & Related papers (2020-02-05T10:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.