Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM
- URL: http://arxiv.org/abs/2402.00097v2
- Date: Tue, 2 Apr 2024 21:23:03 GMT
- Title: Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM
- Authors: Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, Baishakhi Ray,
- Abstract summary: We present SymPrompt, a code-aware prompting strategy for large language models in test generation.
SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2.
Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
- Score: 32.44432906540792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Testing plays a pivotal role in ensuring software quality, yet conventional Search Based Software Testing (SBST) methods often struggle with complex software units, achieving suboptimal test coverage. Recent works using large language models (LLMs) for test generation have focused on improving generation quality through optimizing the test generation context and correcting errors in model outputs, but use fixed prompting strategies that prompt the model to generate tests without additional guidance. As a result LLM-generated testsuites still suffer from low coverage. In this paper, we present SymPrompt, a code-aware prompting strategy for LLMs in test generation. SymPrompt's approach is based on recent work that demonstrates LLMs can solve more complex logical problems when prompted to reason about the problem in a multi-step fashion. We apply this methodology to test generation by deconstructing the testsuite generation process into a multi-stage sequence, each of which is driven by a specific prompt aligned with the execution paths of the method under test, and exposing relevant type and dependency focal context to the model. Our approach enables pretrained LLMs to generate more complete test cases without any additional training. We implement SymPrompt using the TreeSitter parsing framework and evaluate on a benchmark challenging methods from open source Python projects. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
Related papers
- Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models [76.17975723711886]
Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs)
In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation.
Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores.
arXiv Detail & Related papers (2025-02-20T10:25:13Z) - Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation [11.037212298533069]
Large Language Models (LLMs) have opened up new opportunities to generate tests automatically.
This paper studies automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation.
Our results show that while LLM-based test generation is promising, it falls behind traditional methods in terms of coverage.
arXiv Detail & Related papers (2025-01-17T13:48:32Z) - AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z) - TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models [8.22619177301814]
We introduce TestBench, a benchmark for class-level LLM-based test case generation.
We construct a dataset of 108 Java programs from 9 real-world, large-scale projects on GitHub.
We propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate.
arXiv Detail & Related papers (2024-09-26T06:18:06Z) - SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists [59.08999823652293]
We propose SYNTHEVAL to generate a wide range of test types for a comprehensive evaluation of NLP models.
In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit.
We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.
arXiv Detail & Related papers (2024-08-30T17:41:30Z) - HITS: High-coverage LLM-based Unit Test Generation via Method Slicing [37.43624865049592]
Large language models (LLMs) have behaved well in generating unit tests for Java projects.
However, the performance for covering the complex focal methods within the projects is poor.
We propose decomposing the focal methods into slices and asking the LLM to generate test cases slice by slice.
arXiv Detail & Related papers (2024-08-21T04:14:26Z) - Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.056044348209483]
Unit testing, crucial for identifying bugs in code modules like classes and methods, is often neglected by developers due to time constraints.
Large Language Models (LLMs), like GPT and Mistral, show promise in software engineering, including in test generation.
arXiv Detail & Related papers (2024-06-28T20:38:41Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.
We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases.
We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs.
Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.