Related papers: Type-aware LLM-based Regression Test Generation for Python Programs

Type-aware LLM-based Regression Test Generation for Python Programs

URL: http://arxiv.org/abs/2503.14000v2
Date: Wed, 22 Oct 2025 07:44:01 GMT
Title: Type-aware LLM-based Regression Test Generation for Python Programs
Authors: Runlin Liu, Zhe Zhang, Yunge Hu, Yuhang Lin, Xiang Gao, Hailong Sun,
Abstract summary: Test4Py is a novel framework that enhances type correctness in automated test generation for Python.<n>Test4Py integrates an iterative repair procedure that progressively refines generated test cases to improve coverage.<n>In an evaluation on 183 real-world Python modules, Test4Py achieved an average statement coverage of 83.0% and branch coverage of 70.8%.
Score: 13.631541369653066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated regression test generation has been extensively explored, yet generating high-quality tests for Python programs remains particularly challenging. Because of the Python's dynamic typing features, existing approaches, ranging from search-based software testing (SBST) to recent LLM-driven techniques, are often prone to type errors. Hence, existing methods often generate invalid inputs and semantically inconsistent test cases, which ultimately undermine their practical effectiveness. To address these limitations, we present Test4Py, a novel framework that enhances type correctness in automated test generation for Python. Test4Py leverages the program's call graph to capture richer contextual information about parameters, and introduces a behavior-based type inference mechanism that accurately infers parameter types and construct valid test inputs. Beyond input construction, Test4Py integrates an iterative repair procedure that progressively refines generated test cases to improve coverage. In an evaluation on 183 real-world Python modules, Test4Py achieved an average statement coverage of 83.0% and branch coverage of 70.8%, outperforming state-of-the-art tools by 7.2% and 8.4%, respectively.

Related papers

Synthesizing File-Level Data for Unit Test Generation with Chain-of-Thoughts via Self-Debugging [40.29934051200609]
We propose a novel data-distillation approach to produce high-quality UT training.<n>We apply this pipeline to a large corpus of open-source projects.<n>An empirical evaluation shows that the fine-tuned model achieves high UT generation effectiveness.
arXiv Detail & Related papers (2026-02-03T06:52:54Z)
Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models [13.969152395348653]
RTED is a type-aware test generation technique for automatically detecting Python type errors.<n>We show that RTED can detect 22-29 more benchmarked type errors than four state-of-the-art techniques.<n>It is also capable of producing fewer false positives, achieving an improvement of 173.9%-245.9% in precision.
arXiv Detail & Related papers (2025-07-03T05:10:33Z)
Combining Type Inference and Automated Unit Test Generation for Python [9.856068089918555]
The lack of type information in dynamically-typed programming languages, such as Python, inhibits test generators.<n>We introduce type tracing, which extracts type-related information during execution and gradually refines the available type information.<n>The approach leads to up to 90.0% more branch coverage, improved mutation scores, and to type information of similar quality to that produced by other state-of-the-art type-inference tools.
arXiv Detail & Related papers (2025-07-02T08:41:28Z)
Boosting Rust Unit Test Coverage through Hybrid Program Analysis and Large Language Models [14.536415473544146]
This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests.<n> PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints.<n>We implement the approach and evaluate it in 10 open-source Rust crates.
arXiv Detail & Related papers (2025-06-10T17:21:21Z)
Examining False Positives under Inference Scaling for Mathematical Reasoning [83.97128486951999]
We systematically examine the prevalence of false positive solutions in mathematical problem solving for language models.<n>Our experimental results reveal that: (1) false positive solutions persist across different models, datasets, and decoding methods, (2) sampling-based inference time scaling methods do not alleviate the problem, and (3) the pass@N evaluation metric is more susceptible to false positives.
arXiv Detail & Related papers (2025-02-10T07:49:35Z)
Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)<n>We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.<n>We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z)
LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs.<n>We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool.<n>Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z)
Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z)
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories. It covers initial tests authoring, test suite completion, and code coverage improvements. We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z)
TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.833381226332574]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)
Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM [32.44432906540792]
We present SymPrompt, a code-aware prompting strategy for large language models in test generation. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
arXiv Detail & Related papers (2024-01-31T18:21:49Z)
Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs) Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z)
Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles. The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z)
A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [117.72709110877939]
Test-time adaptation (TTA) has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions. We categorize TTA into several distinct groups based on the form of test data, namely, test-time domain adaptation, test-time batch adaptation, and online test-time adaptation.
arXiv Detail & Related papers (2023-03-27T16:32:21Z)
An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation [3.9762912548964864]
This paper presents a large-scale empirical evaluation on the effectiveness of Large Language Models for automated unit test generation. We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package. We find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests.
arXiv Detail & Related papers (2023-02-13T17:13:41Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.