Related papers: Intention-Driven Generation of Project-Specific Test Cases

Intention-Driven Generation of Project-Specific Test Cases

URL: http://arxiv.org/abs/2507.20619v2
Date: Sun, 14 Sep 2025 08:18:57 GMT
Title: Intention-Driven Generation of Project-Specific Test Cases
Authors: Binhang Qi, Yun Lin, Xinyi Weng, Yuhuan Huang, Chenyan Liu, Hailong Sun, Zhi Jin, Jin Song Dong,
Abstract summary: We propose IntentionTest, which generates project-specific tests given the description of validation intention.<n>We extensively evaluate IntentionTest against state-of-the-art baselines (DA, ChatTester, and EvoSuite) on 4,146 test cases from 13 open-source projects.
Score: 45.2380093475221
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test cases are valuable assets for maintaining software quality. State-of-the-art automated test generation techniques typically focus on maximizing program branch coverage or translating focal methods into test code. However, in contrast to branch coverage or code-to-test translation, practical tests are written out of the need to validate whether a requirement has been fulfilled. Specifically, each test usually reflects a developer's validation intention for a program function, regarding (1) what is the test scenario of a program function? and (2) what is expected behavior under such a scenario? Without taking such intention into account, generated tests are less likely to be adopted in practice. In this work, we propose IntentionTest, which generates project-specific tests given the description of validation intention. The design is motivated by two insights: (1) rationale insight: the description of validation intention regarding scenario description and behavioral expectation, compared to coverage and focal code, carries more crucial information about what to test; and (2) technical insight: practical test code exhibits high duplication, indicating that existing tests are highly reusable for how to test. Therefore, IntentionTest adopts a retrieval-and-edit manner. We extensively evaluate IntentionTest against state-of-the-art baselines (DA, ChatTester, and EvoSuite) on 4,146 test cases from 13 open-source projects. The experimental results show that, with a given validation intention, IntentionTest can (1) generate tests far more semantically relevant to the ground-truth tests by (i) killing 39.0% more common mutants and (ii) calling up to 66.8% more project-specific APIs; and (2) generate 21.3% more successful passing tests.

Related papers

Consistency Meets Verification: Enhancing Test Generation Quality in Large Language Models Without Ground-Truth Solutions [1.9196411948992402]
We present ConVerTest, a novel two-stage pipeline for synthesizing reliable tests without requiring prior code implementations.<n>Experiments on BIGCODEBENCH and LESS BASIC PYTHON PROBLEMS benchmarks demonstrate that ConVerTest improves test validity, line coverage, and mutation scores by up to 39%, 28%, and 18% respectively.
arXiv Detail & Related papers (2026-02-11T04:40:38Z)
E-Test: E'er-Improving Test Suites [8.585182075116336]
E-Test identifies executions that have not yet been tested from large sets of scenarios.<n>It generates new test cases that enhance the test suite.<n>E-Test retrieves not-yet-tested execution scenarios significantly better than state-of-the-art approaches.
arXiv Detail & Related papers (2025-10-21T21:23:33Z)
Automated Test Generation from Program Documentation Encoded in Code Comments [4.696083734269232]
This paper introduces a novel test generation technique that exploits the code-comment documentation constructively.<n>We deliver test cases with names and oracles properly contextualized on the target behaviors.
arXiv Detail & Related papers (2025-04-29T20:23:56Z)
Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
This study is the first empirical study on early test termination due to assertion failure.<n>We investigated 207 versions of 6 open-source projects.<n>Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
arXiv Detail & Related papers (2025-04-06T17:14:09Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.<n>ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z)
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.<n>It covers initial tests authoring, test suite completion, and code coverage improvements.<n>We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z)
Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)
Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications. Most testing procedures still rely on test cases written by humans to form test suites. We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z)
Learning Deep Semantics for Test Completion [46.842174440120196]
We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test. We develop TeCo -- a deep learning model using code semantics for test completion.
arXiv Detail & Related papers (2023-02-20T18:53:56Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
Unit Test Case Generation with Transformers and Focal Context [10.220204860586582]
AthenaTest aims to generate unit test cases by learning from real-world focal methods and developer-written test cases. We introduce Methods2Test, the largest publicly available supervised parallel corpus of unit test case methods and corresponding focal methods in Java. We evaluate AthenaTest on five defects4j projects, generating 25K passing test cases covering 43.7% of the focal methods with only 30 attempts.
arXiv Detail & Related papers (2020-09-11T18:57:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.