Related papers: Mind the Gap: A Readability-Aware Metric for Test Code Complexity

Mind the Gap: A Readability-Aware Metric for Test Code Complexity

URL: http://arxiv.org/abs/2506.06764v1
Date: Sat, 07 Jun 2025 11:16:13 GMT
Title: Mind the Gap: A Readability-Aware Metric for Test Code Complexity
Authors: Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé,
Abstract summary: We introduce CCTR, a Test-Aware Cognitive Complexity metric tailored for unit tests.<n>We evaluate 15,750 test suites generated by EvoSuite, GPT-4o, and Mistral Large-1024 across 350 classes from Defects4J and SF110.
Score: 13.258954013620885
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Automatically generated unit tests-from search-based tools like EvoSuite or LLMs-vary significantly in structure and readability. Yet most evaluations rely on metrics like Cyclomatic Complexity and Cognitive Complexity, designed for functional code rather than test code. Recent studies have shown that SonarSource's Cognitive Complexity metric assigns near-zero scores to LLM-generated tests, yet its behavior on EvoSuite-generated tests and its applicability to test-specific code structures remain unexplored. We introduce CCTR, a Test-Aware Cognitive Complexity metric tailored for unit tests. CCTR integrates structural and semantic features like assertion density, annotation roles, and test composition patterns-dimensions ignored by traditional complexity models but critical for understanding test code. We evaluate 15,750 test suites generated by EvoSuite, GPT-4o, and Mistral Large-1024 across 350 classes from Defects4J and SF110. Results show CCTR effectively discriminates between structured and fragmented test suites, producing interpretable scores that better reflect developer-perceived effort. By bridging structural analysis and test readability, CCTR provides a foundation for more reliable evaluation and improvement of generated tests. We publicly release all data, prompts, and evaluation scripts to support replication.

Related papers

Automated Test Suite Enhancement Using Large Language Models with Few-shot Prompting [0.0]
Unit testing is essential for verifying the functional correctness of code modules.<n>Unit tests generated by tools that employ traditional approaches, such as search-based software testing (SBST), lack readability, naturalness, and practical usability.<n>Software repositories now include a mix of human-written tests, LLM-generated tests, and those from tools employing traditional approaches such as SBST.
arXiv Detail & Related papers (2026-02-12T18:42:49Z)
LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework [2.501198441875755]
AgoneTest is an evaluation framework for Large Language Model-generated unit tests in Java.<n>For the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection.
arXiv Detail & Related papers (2025-11-25T15:33:00Z)
KTester: Leveraging Domain and Testing Knowledge for More Effective LLM-based Test Generation [36.93577367023509]
This paper presents KTester, a novel framework that integrates project-specific knowledge and testing domain knowledge.<n>We evaluate KTester on multiple open-source projects, comparing it against state-of-the-art LLM-based baselines.<n>Results demonstrate that KTester significantly outperforms existing methods across six key metrics.
arXiv Detail & Related papers (2025-11-18T07:57:58Z)
Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness [13.258954013620885]
CTSES is a composite metric that integrates CodeBLEU, METEOR, and ROUGE-L to balance behavior, lexical quality, and structural alignment.<n>Our results show that CTSES yields more faithful and interpretable assessments, better aligned with developer expectations and human intuition than existing metrics.
arXiv Detail & Related papers (2025-06-07T11:18:17Z)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.<n>We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.<n>Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs.<n>We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool.<n>Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z)
Improving the Readability of Automatically Generated Tests using Large Language Models [7.7149881834358345]
We propose to combine the effectiveness of search-based generators with the readability of LLM generated tests.<n>Our approach focuses on improving test and variable names produced by search-based tools, while keeping their semantics unchanged.
arXiv Detail & Related papers (2024-12-25T09:08:53Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z)
Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.517293765116307]
Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected.<n>This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the class level.
arXiv Detail & Related papers (2024-06-28T20:38:41Z)
Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions [0.05657375260432172]
Unit testing is a commonly-used approach in software engineering to test the correctness and robustness of written code. In this study, we explore the effect of different prompts on the quality of unit tests generated by Code Interpreter. We find that the quality of the generated unit tests is not sensitive to changes in minor details in the prompts provided.
arXiv Detail & Related papers (2023-09-30T20:36:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.