Related papers: Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analysis

Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analysis

URL: http://arxiv.org/abs/2404.04966v1
Date: Sun, 7 Apr 2024 14:08:28 GMT
Title: Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analysis
Authors: Chen Yang, Junjie Chen, Bin Lin, Jianyi Zhou, Ziqi Wang,
Abstract summary: We propose TELPA, a novel technique to generate tests that can reach hard-to-cover branches. Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques.
Score: 8.31978033489419
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic test generation plays a critical role in software quality assurance. While the recent advances in Search-Based Software Testing (SBST) and Large Language Models (LLMs) have shown promise in generating useful tests, these techniques still struggle to cover certain branches. Reaching these hard-to-cover branches usually requires constructing complex objects and resolving intricate inter-procedural dependencies in branch conditions, which poses significant challenges for existing test generation techniques. In this work, we propose TELPA, a novel technique aimed at addressing these challenges. Its key insight lies in extracting real usage scenarios of the target method under test to learn how to construct complex objects and extracting methods entailing inter-procedural dependencies with hard-to-cover branches to learn the semantics of branch constraints. To enhance efficiency and effectiveness, TELPA identifies a set of ineffective tests as counter-examples for LLMs and employs a feedback-based process to iteratively refine these counter-examples. Then, TELPA integrates program analysis results and counter-examples into the prompt, guiding LLMs to gain deeper understandings of the semantics of the target method and generate diverse tests that can reach the hard-to-cover branches. Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques, achieving an average improvement of 31.39% and 22.22% in terms of branch coverage.

Related papers

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
LLM Test Generation via Iterative Hybrid Program Analysis [7.121002367542988]
Panta is a technique that emulates the iterative process human developers follow when analyzing code and constructing test cases. Our empirical evaluation, conducted on classes with high cyclomatic complexity from open-source projects, demonstrates that Panta achieves 26% higher line coverage and 23% higher branch coverage.
arXiv Detail & Related papers (2025-03-17T16:10:38Z)
Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation [11.037212298533069]
Large Language Models (LLMs) have opened up new opportunities to generate tests automatically. This paper studies automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation. Our results show that while LLM-based test generation is promising, it falls behind traditional methods in terms of coverage.
arXiv Detail & Related papers (2025-01-17T13:48:32Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
LLM-based Unit Test Generation via Property Retrieval [26.906316611858518]
Property-Based Retrieval Augmentation extends LLM-based Retrieval-Augmented Generation beyond basic vector, text similarity, and graph-based methods. Our approach considers task-specific context and introduces a tailored property retrieval mechanism. We implement this approach in a tool called APT, which sequentially performs preprocessing, property retrieval, and unit test generation.
arXiv Detail & Related papers (2024-10-17T13:33:12Z)
EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
HITS: High-coverage LLM-based Unit Test Generation via Method Slicing [37.43624865049592]
Large language models (LLMs) have behaved well in generating unit tests for Java projects. However, the performance for covering the complex focal methods within the projects is poor. We propose decomposing the focal methods into slices and asking the LLM to generate test cases slice by slice.
arXiv Detail & Related papers (2024-08-21T04:14:26Z)
Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.056044348209483]
Unit testing, crucial for identifying bugs in code modules like classes and methods, is often neglected by developers due to time constraints. Large Language Models (LLMs), like GPT and Mistral, show promise in software engineering, including in test generation.
arXiv Detail & Related papers (2024-06-28T20:38:41Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
RITFIS: Robust input testing framework for LLMs-based intelligent software [6.439196068684973]
RITFIS is the first framework designed to assess the robustness of intelligent software against natural language inputs. RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software. It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation.
arXiv Detail & Related papers (2024-02-21T04:00:54Z)
Robust Analysis of Multi-Task Learning Efficiency: New Benchmarks on Light-Weighed Backbones and Effective Measurement of Multi-Task Learning Challenges by Feature Disentanglement [69.51496713076253]
In this paper, we focus on the aforementioned efficiency aspects of existing MTL methods. We first carry out large-scale experiments of the methods with smaller backbones and on a the MetaGraspNet dataset as a new test ground. We also propose Feature Disentanglement measure as a novel and efficient identifier of the challenges in MTL.
arXiv Detail & Related papers (2024-02-05T22:15:55Z)
Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM [32.44432906540792]
We present SymPrompt, a code-aware prompting strategy for large language models in test generation. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
arXiv Detail & Related papers (2024-01-31T18:21:49Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
Reinforcement Learning for Branch-and-Bound Optimisation using Retrospective Trajectories [72.15369769265398]
Machine learning has emerged as a promising paradigm for branching. We propose retro branching; a simple yet effective approach to RL for branching. We outperform the current state-of-the-art RL branching algorithm by 3-5x and come within 20% of the best IL method's performance on MILPs with 500 constraints and 1000 variables.
arXiv Detail & Related papers (2022-05-28T06:08:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.