Related papers: Coverage Isn't Enough: SBFL-Driven Insights into Manually Created vs. Automatically Generated Tests

Coverage Isn't Enough: SBFL-Driven Insights into Manually Created vs. Automatically Generated Tests

URL: http://arxiv.org/abs/2512.11223v1
Date: Fri, 12 Dec 2025 02:07:31 GMT
Title: Coverage Isn't Enough: SBFL-Driven Insights into Manually Created vs. Automatically Generated Tests
Authors: Sasara Shimizu, Yoshiki Higo,
Abstract summary: This study compares the SBFL score and code coverage of automatically generated tests with those of manually created tests.<n>Our results show that automatically generated tests achieve higher branch coverage than manually created tests, but their SBFL score is lower, especially for code with deeply nested structures.
Score: 0.49416305961918044
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The testing phase is an essential part of software development, but manually creating test cases can be time-consuming. Consequently, there is a growing need for more efficient testing methods. To reduce the burden on developers, various automated test generation tools have been developed, and several studies have been conducted to evaluate the effectiveness of the tests they produce. However, most of these studies focus primarily on coverage metrics, and only a few examine how well the tests support fault localization-particularly using artificial faults introduced through mutation testing. In this study, we compare the SBFL (Spectrum-Based Fault Localization) score and code coverage of automatically generated tests with those of manually created tests. The SBFL score indicates how accurately faults can be localized using SBFL techniques. By employing SBFL score as an evaluation metric-an approach rarely used in prior studies on test generation-we aim to provide new insights into the respective strengths and weaknesses of manually created and automatically generated tests. Our experimental results show that automatically generated tests achieve higher branch coverage than manually created tests, but their SBFL score is lower, especially for code with deeply nested structures. These findings offer guidance on how to effectively combine automatically generated and manually created testing approaches.

Related papers

Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study [40.93176986225226]
Flaky tests yield inconsistent results when they are repeatedly executed on the same code revision.<n>Previous work evaluated approaches to train machine learning models to classify flaky tests based on identifiers in the test code.
arXiv Detail & Related papers (2026-02-05T09:15:09Z)
Automated structural testing of LLM-based agents: methods, framework, and case studies [0.05254956925594667]
LLM-based agents are rapidly being adopted across diverse domains.<n>Current testing approaches focus on acceptance-level evaluation from the user's perspective.<n>We present methods to enable structural testing of LLM-based agents.
arXiv Detail & Related papers (2026-01-25T11:52:30Z)
LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework [2.501198441875755]
AgoneTest is an evaluation framework for Large Language Model-generated unit tests in Java.<n>For the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection.
arXiv Detail & Related papers (2025-11-25T15:33:00Z)
KTester: Leveraging Domain and Testing Knowledge for More Effective LLM-based Test Generation [36.93577367023509]
This paper presents KTester, a novel framework that integrates project-specific knowledge and testing domain knowledge.<n>We evaluate KTester on multiple open-source projects, comparing it against state-of-the-art LLM-based baselines.<n>Results demonstrate that KTester significantly outperforms existing methods across six key metrics.
arXiv Detail & Related papers (2025-11-18T07:57:58Z)
SAINT: Service-level Integration Test Generation with Program Analysis and LLM-based Agents [43.3273990835497]
SAINT is a novel white-box testing approach for service-level testing of enterprise Java applications.<n> SAINT combines static analysis, large language models (LLMs), and LLM-based agents to automatically generate endpoint and scenario-based tests.
arXiv Detail & Related papers (2025-11-17T12:29:42Z)
Intention-Driven Generation of Project-Specific Test Cases [45.2380093475221]
We propose IntentionTest, which generates project-specific tests given the description of validation intention.<n>We extensively evaluate IntentionTest against state-of-the-art baselines (DA, ChatTester, and EvoSuite) on 4,146 test cases from 13 open-source projects.
arXiv Detail & Related papers (2025-07-28T08:35:04Z)
Are Autonomous Web Agents Good Testers? [41.56233403862961]
Large Language Models (LLMs) offer a potential alternative by powering Autonomous Web Agents (AWAs)<n>AWAs may serve as Autonomous Test Agents (ATAs)<n>This paper investigates the feasibility of adapting AWAs for natural language test case execution.
arXiv Detail & Related papers (2025-04-02T08:48:01Z)
Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code. We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z)
ASTER: Natural and Multi-language Unit Test Generation with LLMs [6.259245181881262]
We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases.<n>We conduct an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness.
arXiv Detail & Related papers (2024-09-04T21:46:18Z)
A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites [1.4563527353943984]
Large Language Models (LLMs) have been applied to various aspects of software development. We present AgoneTest: an automated system for generating test suites for Java projects.
arXiv Detail & Related papers (2024-08-14T23:02:16Z)
TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.509927117191286]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z)
Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs) Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z)
Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles. The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z)
Sequential Kernelized Independence Testing [77.237958592189]
We design sequential kernelized independence tests inspired by kernelized dependence measures.<n>We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z)
TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision [70.05605071885914]
We propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples. We show the success of our method on the common benchmark dataset CIFAR10-C.
arXiv Detail & Related papers (2022-05-18T05:43:06Z)
Active Testing: Sample-Efficient Model Evaluation [39.200332879659456]
We introduce active testing: a new framework for sample-efficient model evaluation. Active testing addresses this by carefully selecting the test points to label. We show how to remove that bias while reducing the variance of the estimator.
arXiv Detail & Related papers (2021-03-09T10:20:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.