Navigating the Labyrinth: Path-Sensitive Unit Test Generation with Large Language Models
- URL: http://arxiv.org/abs/2509.23812v2
- Date: Sat, 11 Oct 2025 07:22:01 GMT
- Title: Navigating the Labyrinth: Path-Sensitive Unit Test Generation with Large Language Models
- Authors: Dianshu Liao, Xin Yin, Shidong Pan, Chao Ni, Zhenchang Xing, Xiaoyu Sun,
- Abstract summary: Unit testing is essential for software quality assurance, yet writing and maintaining tests remains time-consuming and errorprone.<n>We present a path-sensitive framework, JUnitGenie, to fill this gap by combining code knowledge with the semantic capabilities of LLMs.<n>We evaluate JUnitGenie on 2,258 complex focal methods from ten real-world Java projects.
- Score: 13.89267206718739
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unit testing is essential for software quality assurance, yet writing and maintaining tests remains time-consuming and error-prone. To address this challenge, researchers have proposed various techniques for automating unit test generation, including traditional heuristic-based methods and more recent approaches that leverage large language models (LLMs). However, these existing approaches are inherently path-insensitive because they rely on fixed heuristics or limited contextual information and fail to reason about deep control-flow structures. As a result, they often struggle to achieve adequate coverage, particularly for deep or complex execution paths. In this work, we present a path-sensitive framework, JUnitGenie, to fill this gap by combining code knowledge with the semantic capabilities of LLMs in guiding context-aware unit test generation. After extracting code knowledge from Java projects, JUnitGenie distills this knowledge into structured prompts to guide the generation of high-coverage unit tests. We evaluate JUnitGenie on 2,258 complex focal methods from ten real-world Java projects. The results show that JUnitGenie generates valid tests and improves branch and line coverage by 29.60% and 31.00% on average over both heuristic and LLM-based baselines. We further demonstrate that the generated test cases can uncover real-world bugs, which were later confirmed and fixed by developers.
Related papers
- Enhancing LLM-Based Test Generation by Eliminating Covered Code [2.2566909388480743]
Large Language Models (LLMs) have shown promise in improving test generation.<n>We propose a scalable LLM-based unit test generation method.<n>Our approach outperforms state-of-the-art LLM-based and search-based methods.
arXiv Detail & Related papers (2026-02-25T15:16:43Z) - LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework [2.501198441875755]
AgoneTest is an evaluation framework for Large Language Model-generated unit tests in Java.<n>For the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection.
arXiv Detail & Related papers (2025-11-25T15:33:00Z) - ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases [58.411135609139855]
"Shortcuts" to complete tasks pose significant risks for reliable assessment and deployment of large language models.<n>We introduce ImpossibleBench, a benchmark framework that measures LLM agents' propensity to exploit test cases.<n>As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool.
arXiv Detail & Related papers (2025-10-23T06:58:32Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - Alignment with Fill-In-the-Middle for Enhancing Code Generation [56.791415642365415]
We propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases.<n>Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench.
arXiv Detail & Related papers (2025-08-27T03:15:53Z) - Seed&Steer: Guiding Large Language Models with Compilable Prefix and Branch Signals for Unit Test Generation [20.083515771706473]
Unit tests play a vital role in the software development lifecycle.<n>Recent advances in Large Language Model (LLM)-based approaches have significantly improved automated test generation.<n>We propose Seed&Steer, a two-step approach that combines traditional unit testing techniques with the capabilities of large language models.
arXiv Detail & Related papers (2025-07-23T07:16:46Z) - Boosting Rust Unit Test Coverage through Hybrid Program Analysis and Large Language Models [14.536415473544146]
This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests.<n> PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints.<n>We implement the approach and evaluate it in 10 open-source Rust crates.
arXiv Detail & Related papers (2025-06-10T17:21:21Z) - ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.<n>ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - ASTER: Natural and Multi-language Unit Test Generation with LLMs [6.259245181881262]
We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases.<n>We conduct an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness.
arXiv Detail & Related papers (2024-09-04T21:46:18Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.<n>We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.<n>We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.