A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability
- URL: http://arxiv.org/abs/2502.02866v1
- Date: Wed, 05 Feb 2025 03:51:44 GMT
- Title: A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability
- Authors: Hung-Fu Chang, Mohammad Shokrolah Shirazi,
- Abstract summary: We propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach to evaluate large language models (LLMs)<n>By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex.<n>Our findings indicate that GPT-4o performs better on complex program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations.
- Score: 0.8287206589886879
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Software testing ensures the quality and reliability of software products, but manual test case creation is labor-intensive. With the rise of large language models (LLMs), there is growing interest in unit test creation with LLMs. However, effective assessment of LLM-generated test cases is limited by the lack of standardized benchmarks that comprehensively cover diverse programming scenarios. To address the assessment of LLM's test case generation ability and lacking dataset for evaluation, we propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach, which systematically generates programs used for evaluating LLMs' test generation capabilities. By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex. Because GPT-4o and GPT-3-Turbo are publicly accessible models, to present real-world regular user's use case, we use GBCV to assess LLM performance on them. Our findings indicate that GPT-4o performs better on complex program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations. This study highlights the strengths and limitations of LLMs in test generation, provides a benchmark framework, and suggests directions for future improvement.
Related papers
- An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning [52.29223403698673]
This paper examines the use of Conformal Language Modelling (CLM) alongside Answer Set Programming (ASP)
We apply CLM to generate sets of ASP programs from an LLM, providing statistical guarantees on the correctness of the outputs.
Experimental results show that CLM significantly outperforms baseline models that use standard sampling methods.
arXiv Detail & Related papers (2025-03-07T14:10:10Z) - Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation [11.037212298533069]
Large Language Models (LLMs) have opened up new opportunities to generate tests automatically.
This paper studies automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation.
Our results show that while LLM-based test generation is promising, it falls behind traditional methods in terms of coverage.
arXiv Detail & Related papers (2025-01-17T13:48:32Z) - StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.
Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.
We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z) - TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models [8.22619177301814]
We introduce TestBench, a benchmark for class-level LLM-based test case generation.
We construct a dataset of 108 Java programs from 9 real-world, large-scale projects on GitHub.
We propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate.
arXiv Detail & Related papers (2024-09-26T06:18:06Z) - Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.056044348209483]
Unit testing, crucial for identifying bugs in code modules like classes and methods, is often neglected by developers due to time constraints.
Large Language Models (LLMs), like GPT and Mistral, show promise in software engineering, including in test generation.
arXiv Detail & Related papers (2024-06-28T20:38:41Z) - On the Evaluation of Large Language Models in Unit Test Generation [16.447000441006814]
Unit testing is an essential activity in software development for verifying the correctness of software components.
The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation.
arXiv Detail & Related papers (2024-06-26T08:57:03Z) - Automatic benchmarking of large multimodal models via iterative experiment programming [71.78089106671581]
We present APEx, the first framework for automatic benchmarking of LMMs.
Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand.
The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions.
arXiv Detail & Related papers (2024-06-18T06:43:46Z) - Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM [32.44432906540792]
We present SymPrompt, a code-aware prompting strategy for large language models in test generation.
SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2.
Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
arXiv Detail & Related papers (2024-01-31T18:21:49Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.