Related papers: Large Language Models for Unit Test Generation: Achievements, Challenges, and the Road Ahead

Large Language Models for Unit Test Generation: Achievements, Challenges, and the Road Ahead

URL: http://arxiv.org/abs/2511.21382v1
Date: Wed, 26 Nov 2025 13:30:11 GMT
Title: Large Language Models for Unit Test Generation: Achievements, Challenges, and the Road Ahead
Authors: Bei Chu, Yang Feng, Kui Liu, Zifan Nan, Zhaoqiang Guo, Baowen Xu,
Abstract summary: Unit testing is an essential yet laborious technique for verifying software.<n>Large Language Models (LLMs) address this limitation by utilizing by leveraging their data-driven knowledge of code semantics and programming patterns.<n>This framework analyzes the literature regarding core generative strategies and a set of enhancement techniques.
Score: 15.43943391801509
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unit testing is an essential yet laborious technique for verifying software and mitigating regression risks. Although classic automated methods effectively explore program structures, they often lack the semantic information required to produce realistic inputs and assertions. Large Language Models (LLMs) address this limitation by utilizing by leveraging their data-driven knowledge of code semantics and programming patterns. To analyze the state of the art in this domain, we conducted a systematic literature review of 115 publications published between May 2021 and August 2025. We propose a unified taxonomy based on the unit test generation lifecycle that treats LLMs as stochastic generators requiring systematic engineering constraints. This framework analyzes the literature regarding core generative strategies and a set of enhancement techniques ranging from pre-generation context enrichment to post-generation quality assurance. Our analysis reveals that prompt engineering has emerged as the dominant utilization strategy and accounts for 89% of the studies due to its flexibility. We find that iterative validation and repair loops have become the standard mechanism to ensure robust usability and lead to significant improvements in compilation and execution pass rates. However, critical challenges remain regarding the weak fault detection capabilities of generated tests and the lack of standardized evaluation benchmarks. We conclude with a roadmap for future research that emphasizes the progression towards autonomous testing agents and hybrid systems combining LLMs with traditional software engineering tools. This survey provides researchers and practitioners with a comprehensive perspective on converting the potential of LLMs into industrial-grade testing solutions.

Related papers

Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey [59.3507264893654]
Issue resolution is a complex Software Engineering task integral to real-world development.<n> benchmarks like SWE-bench revealed this task as profoundly difficult for large language models.<n>This paper presents a systematic survey of this emerging domain.
arXiv Detail & Related papers (2026-01-15T18:55:03Z)
SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z)
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [56.40989626804489]
This survey provides the first holistic analysis of Large Language Models-powered software engineering.<n>We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair.
arXiv Detail & Related papers (2025-10-10T06:56:50Z)
AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits, Challenges, and Future Directions [0.0]
The research aims to assess the benefits, challenges, and prospects of integrating modern AI-oriented tools into quality assurance processes.<n>The research demonstrates AI's transformative potential for QA but highlights the importance of a strategic approach to implementing these technologies.
arXiv Detail & Related papers (2025-06-19T20:22:47Z)
Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry [0.5735035463793009]
Large Language Models (LLMs) have shown great potential in automating software testing tasks, including test generation.<n>Their rapid evolution poses a critical challenge for companies implementing DevSecOps.<n>This work presents a measurement framework for the continuous evaluation of commercial LLM test generators in industrial environments.
arXiv Detail & Related papers (2025-04-26T18:08:13Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Vulnerability Detection: From Formal Verification to Large Language Models and Hybrid Approaches: A Comprehensive Overview [3.135279672650891]
This paper focuses on state-of-the-art software testing and verification.<n>It focuses on three key approaches: classical formal methods, LLM-based analysis, and emerging hybrid techniques.<n>We analyze whether integrating formal rigor with LLM-driven insights can enhance the effectiveness and scalability of software verification.
arXiv Detail & Related papers (2025-03-13T18:22:22Z)
Requirements-Driven Automated Software Testing: A Systematic Review [12.953746641112518]
This systematic literature review critically examines the current state of requirements input formats, transformation techniques, generated test artifacts, evaluation methods, and prevailing limitations.<n>Our findings highlight the predominance of functional requirements, model-based specifications, and natural language formats.<n>Most existing frameworks are sequential and dependent on singular intermediate representations, and while test cases, structured textual formats, and requirements coverage are common, full automation remains rare.
arXiv Detail & Related papers (2025-02-25T23:13:09Z)
Large Language Model for Qualitative Research -- A Systematic Mapping Study [3.302912592091359]
Large Language Models (LLMs), powered by advanced generative AI, have emerged as transformative tools.<n>This study systematically maps the literature on the use of LLMs for qualitative research.<n>Findings reveal that LLMs are utilized across diverse fields, demonstrating the potential to automate processes.
arXiv Detail & Related papers (2024-11-18T21:28:00Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.