Synthesizing File-Level Data for Unit Test Generation with Chain-of-Thoughts via Self-Debugging
- URL: http://arxiv.org/abs/2602.03181v1
- Date: Tue, 03 Feb 2026 06:52:54 GMT
- Title: Synthesizing File-Level Data for Unit Test Generation with Chain-of-Thoughts via Self-Debugging
- Authors: Ziyue Hua, Tianyu Chen, Yeyun Gong, Shuai Lu, Peng Cheng, Qinglin Zhu, Yibo He, Yingjie Fu, Wenpin Jiao, Wei Yang, Tao Xie,
- Abstract summary: We propose a novel data-distillation approach to produce high-quality UT training.<n>We apply this pipeline to a large corpus of open-source projects.<n>An empirical evaluation shows that the fine-tuned model achieves high UT generation effectiveness.
- Score: 40.29934051200609
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic unit test (UT) generation is essential for software quality assurance, but existing approaches--including symbolic execution, search-based approaches, and recent LLM-based generators--struggle to produce human-quality tests with correct, meaningful assertions and reliable chain-of-thought (CoT) explanations. We identify a gap in UT training data: repository-mined tests lack developer CoTs, while LLM-distilled CoTs are often incorrect or incomplete. To address this issue, we propose a novel data-distillation approach that uses self-debugging to produce high-quality UT training examples paired with faithful CoTs. Our approach combines (1) guided test repair, a heuristic loop (error-, failure-, and coverage-focused steps) that asks the used model to diagnose and iteratively fix generated tests, and (2) CoT compression, which compacts original and debugging CoTs into concise explanations that directly justify correct tests. We apply this pipeline to a large corpus of open-source projects to construct a dataset of 74,518 high-quality <focal method, test, CoT> examples, and then use it for supervised fine-tuning of a base model. An empirical evaluation shows that the fine-tuned model achieves high UT generation effectiveness: it attains a pass rate of 36.17% on test assertions, a branch coverage of 43.90%, and a mutation score of 88.66%, substantially higher than state-of-the-art commercial models like o4-mini.
Related papers
- CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - Self-Improving LLM Agents at Test-Time [49.9396634315896]
One paradigm of language model (LM) fine-tuning relies on creating large training datasets.<n>In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive.<n>We study two variants of this approach: Test-Time Self-Improvement (TT-SI) and Test-Time Distillation (TT-D)
arXiv Detail & Related papers (2025-10-09T06:37:35Z) - Clarifying Semantics of In-Context Examples for Unit Test Generation [16.066591207494046]
We propose CLAST, a technique that systematically refines unit tests to improve their semantic clarity.<n>CLAST largely outperforms UTgen, the state-of-the-art refinement technique, in both preserving test effectiveness and enhancing semantic clarity.<n>Over 85.33% of participants in our user study preferred the semantic clarity of CLAST-refined tests.
arXiv Detail & Related papers (2025-10-02T13:15:40Z) - PALM: Synergizing Program Analysis and LLMs to Enhance Rust Unit Test Coverage [14.702182387149547]
This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests.<n> PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints.<n>We implement the approach and evaluate it in 15 open-source Rust crates.
arXiv Detail & Related papers (2025-06-10T17:21:21Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Improving Deep Assertion Generation via Fine-Tuning Retrieval-Augmented Pre-trained Language Models [20.71745514142851]
RetriGen is a retrieval-augmented deep assertion generation approach.<n>We conduct experiments to evaluate RetriGen against six state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-22T04:17:04Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Enriching Automatic Test Case Generation by Extracting Relevant Test Inputs from Bug Reports [10.587260348588064]
We introduce BRMiner, a novel approach that leverages Large Language Models (LLMs) in combination with traditional techniques to extract relevant inputs from bug reports.<n>In this study, we evaluate BRMiner using the Defects4J benchmark and test generation tools such as EvoSuite and Randoop.<n>Our results demonstrate that BRMiner achieves a Relevant Input Rate (RIR) of 60.03% and a Relevant Input Extraction Accuracy Rate (RIEAR) of 31.71%.
arXiv Detail & Related papers (2023-12-22T18:19:33Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.