Fixturize: Bridging the Fixture Gap in Test Generation
- URL: http://arxiv.org/abs/2601.06615v1
- Date: Sat, 10 Jan 2026 16:47:32 GMT
- Title: Fixturize: Bridging the Fixture Gap in Test Generation
- Authors: Pengyu Xue, Chengyi Wang, Zhen Yang, Xiapu Luo, Yuxuan Zhang, Xiran Lyu, Yifei Pei, Zonghan Jia, Yichen Sun, Linhao Wu, Kunwu Zheng,
- Abstract summary: Fixturize is a diagnostic framework that proactively identifies fixture-dependent functions.<n>It synthesizes test fixtures accordingly through an iterative, feedback-driven process.
- Score: 31.82935387488973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current Large Language Models (LLMs) have advanced automated unit test generation but face a critical limitation: they often neglect to construct the necessary test fixtures, which are the environmental setups required for a test to run. To bridge this gap, this paper proposes Fixturize, a diagnostic framework that proactively identifies fixture-dependent functions and synthesizes test fixtures accordingly through an iterative, feedback-driven process, thereby improving the quality of auto-generated test suites of existing approaches. For rigorous evaluation, the authors introduce FixtureEval, a dedicated benchmark comprising 600 curated functions across two Programming Languages (PLs), i.e., Python and Java, with explicit fixture dependency labels, enabling both the corresponding classification and generation tasks. Empirical results demonstrate that Fixturize is highly effective, achieving 88.38%-97.00% accuracy across benchmarks in identifying the dependence of test fixtures and significantly enhancing the Suite Pass rate (SuitePS) by 18.03%-42.86% on average across both PLs with the auto-generated fixtures. Owing to the maintenance of test fixtures, Fixturize further improves line/branch coverage when integrated with existing testing tools of both LLM-based and Search-based by 16.85%/24.08% and 31.54%/119.66% on average, respectively. The findings establish fixture awareness as an essential, missing component in modern auto-testing pipelines.
Related papers
- PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering [71.15346406323827]
We introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification.<n>We find that current verifiers frequently fail to detect derivation flaws.<n>We propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME.
arXiv Detail & Related papers (2026-02-12T04:45:01Z) - Synthesizing File-Level Data for Unit Test Generation with Chain-of-Thoughts via Self-Debugging [40.29934051200609]
We propose a novel data-distillation approach to produce high-quality UT training.<n>We apply this pipeline to a large corpus of open-source projects.<n>An empirical evaluation shows that the fine-tuned model achieves high UT generation effectiveness.
arXiv Detail & Related papers (2026-02-03T06:52:54Z) - The Rise of Agentic Testing: Multi-Agent Systems for Robust Software Quality Assurance [0.0]
Current AI-based test generators produce invalid, redundant, or non-executable tests due to lack of execution aware feedback.<n>This paper introduces a closed-loop, self-correcting system in which a Test Generation Agent, an Execution and Analysis Agent, and a Review and Optimization Agent collaboratively generate, execute, analyze, and refine tests.
arXiv Detail & Related papers (2026-01-05T18:20:14Z) - KTester: Leveraging Domain and Testing Knowledge for More Effective LLM-based Test Generation [36.93577367023509]
This paper presents KTester, a novel framework that integrates project-specific knowledge and testing domain knowledge.<n>We evaluate KTester on multiple open-source projects, comparing it against state-of-the-art LLM-based baselines.<n>Results demonstrate that KTester significantly outperforms existing methods across six key metrics.
arXiv Detail & Related papers (2025-11-18T07:57:58Z) - Unit Test Update through LLM-Driven Context Collection and Error-Type-Aware Refinement [5.8748750353007635]
Test maintenance methods primarily focus on repairing broken tests, neglecting the scenario of enhancing existing tests to verify new functionality.<n>We propose TESTUPDATER, a novel approach that enables automated just-in-time test updates in response to production code changes.<n>TestUPDATER achieves a compilation pass rate of 94.4% and a test pass rate of 86.7%, outperforming the state-of-the-art method SYNTER by 15.9% and 20.0%, respectively.
arXiv Detail & Related papers (2025-09-29T08:08:22Z) - A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models [53.31664844941449]
ProActive Self-Refinement (PASR) is a novel method for improving large language models (LLMs)<n>Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context.<n>We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR.
arXiv Detail & Related papers (2025-08-18T13:07:21Z) - PALM: Synergizing Program Analysis and LLMs to Enhance Rust Unit Test Coverage [14.702182387149547]
This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests.<n> PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints.<n>We implement the approach and evaluate it in 15 open-source Rust crates.
arXiv Detail & Related papers (2025-06-10T17:21:21Z) - Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation [58.85645136534301]
Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks.<n>We propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold.
arXiv Detail & Related papers (2025-04-16T14:16:38Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.