Related papers: KTester: Leveraging Domain and Testing Knowledge for More Effective LLM-based Test Generation

KTester: Leveraging Domain and Testing Knowledge for More Effective LLM-based Test Generation

URL: http://arxiv.org/abs/2511.14224v1
Date: Tue, 18 Nov 2025 07:57:58 GMT
Title: KTester: Leveraging Domain and Testing Knowledge for More Effective LLM-based Test Generation
Authors: Anji Li, Mingwei Liu, Zhenxi Chen, Zheng Pei, Zike Li, Dekun Dai, Yanlin Wang, Zibin Zheng,
Abstract summary: This paper presents KTester, a novel framework that integrates project-specific knowledge and testing domain knowledge.<n>We evaluate KTester on multiple open-source projects, comparing it against state-of-the-art LLM-based baselines.<n>Results demonstrate that KTester significantly outperforms existing methods across six key metrics.
Score: 36.93577367023509
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated unit test generation using large language models (LLMs) holds great promise but often struggles with generating tests that are both correct and maintainable in real-world projects. This paper presents KTester, a novel framework that integrates project-specific knowledge and testing domain knowledge to enhance LLM-based test generation. Our approach first extracts project structure and usage knowledge through static analysis, which provides rich context for the model. It then employs a testing-domain-knowledge-guided separation of test case design and test method generation, combined with a multi-perspective prompting strategy that guides the LLM to consider diverse testing heuristics. The generated tests follow structured templates, improving clarity and maintainability. We evaluate KTester on multiple open-source projects, comparing it against state-of-the-art LLM-based baselines using automatic correctness and coverage metrics, as well as a human study assessing readability and maintainability. Results demonstrate that KTester significantly outperforms existing methods across six key metrics, improving execution pass rate by 5.69% and line coverage by 8.83% over the strongest baseline, while requiring less time and generating fewer test cases. Human evaluators also rate the tests produced by KTester significantly higher in terms of correctness, readability, and maintainability, confirming the practical advantages of our knowledge-driven framework.

Related papers

Automated Test Suite Enhancement Using Large Language Models with Few-shot Prompting [0.0]
Unit testing is essential for verifying the functional correctness of code modules.<n>Unit tests generated by tools that employ traditional approaches, such as search-based software testing (SBST), lack readability, naturalness, and practical usability.<n>Software repositories now include a mix of human-written tests, LLM-generated tests, and those from tools employing traditional approaches such as SBST.
arXiv Detail & Related papers (2026-02-12T18:42:49Z)
Clarifying Semantics of In-Context Examples for Unit Test Generation [16.066591207494046]
We propose CLAST, a technique that systematically refines unit tests to improve their semantic clarity.<n>CLAST largely outperforms UTgen, the state-of-the-art refinement technique, in both preserving test effectiveness and enhancing semantic clarity.<n>Over 85.33% of participants in our user study preferred the semantic clarity of CLAST-refined tests.
arXiv Detail & Related papers (2025-10-02T13:15:40Z)
Hamster: A Large-Scale Study and Characterization of Developer-Written Tests [44.65515600399573]
We study developer-written tests for Java applications, covering 1.7 million test cases from open-source repositories.<n>Our results highlight that a vast majority of developer-written tests exhibit characteristics beyond the capabilities of current ATG tools.<n>We identify promising research directions that can help bridge the gap between current tool capabilities and more effective tool support for developer testing practices.
arXiv Detail & Related papers (2025-09-30T13:08:23Z)
PALM: Synergizing Program Analysis and LLMs to Enhance Rust Unit Test Coverage [14.702182387149547]
This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests.<n> PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints.<n>We implement the approach and evaluate it in 15 open-source Rust crates.
arXiv Detail & Related papers (2025-06-10T17:21:21Z)
TestForge: Feedback-Driven, Agentic Test Suite Generation [7.288137795439405]
TestForge is an agentic unit testing framework designed to cost-effectively generate high-quality test suites for real-world code.<n>TestForge produces more natural and understandable tests compared to state-of-the-art search-based techniques.
arXiv Detail & Related papers (2025-03-18T20:21:44Z)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.<n>We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.<n>Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.<n>ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z)
Improving the Readability of Automatically Generated Tests using Large Language Models [7.7149881834358345]
We propose to combine the effectiveness of search-based generators with the readability of LLM generated tests.<n>Our approach focuses on improving test and variable names produced by search-based tools, while keeping their semantics unchanged.
arXiv Detail & Related papers (2024-12-25T09:08:53Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.