Related papers: Understanding the Characteristics of LLM-Generated Property-Based Tests in Exploring Edge Cases

Understanding the Characteristics of LLM-Generated Property-Based Tests in Exploring Edge Cases

URL: http://arxiv.org/abs/2510.25297v1
Date: Wed, 29 Oct 2025 09:07:03 GMT
Title: Understanding the Characteristics of LLM-Generated Property-Based Tests in Exploring Edge Cases
Authors: Hidetake Tanaka, Haruto Tanaka, Kazumasa Shimari, Kenichi Matsumoto,
Abstract summary: This research investigates the characteristics of LLM-generated Property-based Testing (PBT) compared to EBT for exploring edge cases.<n>We analyze 16 HumanEval problems where standard solutions failed on extended test cases, generating both PBT and EBT test codes.<n>Our experimental results reveal that while each method individually achieved a 68.75% bug detection rate, combining both approaches improved detection to 81.25%.
Score: 1.1279582296582873
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) increasingly generate code in software development, ensuring the quality of LLM-generated code has become important. Traditional testing approaches using Example-based Testing (EBT) often miss edge cases -- defects that occur at boundary values, special input patterns, or extreme conditions. This research investigates the characteristics of LLM-generated Property-based Testing (PBT) compared to EBT for exploring edge cases. We analyze 16 HumanEval problems where standard solutions failed on extended test cases, generating both PBT and EBT test codes using Claude-4-sonnet. Our experimental results reveal that while each method individually achieved a 68.75\% bug detection rate, combining both approaches improved detection to 81.25\%. The analysis demonstrates complementary characteristics: PBT effectively detects performance issues and edge cases through extensive input space exploration, while EBT effectively detects specific boundary conditions and special patterns. These findings suggest that a hybrid approach leveraging both testing methods can improve the reliability of LLM-generated code, providing guidance for test generation strategies in LLM-based code generation.

Related papers

Rethinking Verification for LLM Code Generation: From Generation to Testing [44.46778801679273]
Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench.<n>We propose a new multi-dimensional metrics designed to rigorously quantify test-suite.<n> Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench.
arXiv Detail & Related papers (2025-07-09T14:58:47Z)
Use Property-Based Testing to Bridge LLM Code Generation and Validation [38.25155484701058]
Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct is a persistent challenge.<n>This paper introduces Property-Generated Solver, a novel framework that leverages Property-Based Testing (PBT) to validate high-level program properties.<n>Property-Generated Solver employs two collaborative LLM-based agents: a Generator dedicated to code generation and iterative refinement, and a Tester that manages the PBT life-cycle.
arXiv Detail & Related papers (2025-06-23T06:01:12Z)
Boundary Value Test Input Generation Using Prompt Engineering with LLMs: Fault Detection and Coverage Analysis [3.249891166806818]
This paper presents a framework for assessing the effectiveness of large language models (LLMs) in generating boundary value test inputs for white-box software testing.<n>Our analysis shows the strengths and limitations of LLMs in boundary value generation, particularly in detecting common boundary-related issues.<n>This research provides insights into the role of LLMs in boundary value testing, underscoring both their potential and areas for improvement in automated testing methods.
arXiv Detail & Related papers (2025-01-24T12:54:19Z)
Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy [0.5057850174013127]
Modern Large Language Model (LLM)-based programming agents often rely on test execution feedback to refine their generated code.<n>This paper introduces VALTEST, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs.<n>Experiments show that VALTEST boosts test validity by up to 29% and improves code generation performance, as evidenced by significant increases in pass@1 scores.
arXiv Detail & Related papers (2024-11-13T00:07:32Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors. We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z)
Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases. We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs. Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z)
Enriching Automatic Test Case Generation by Extracting Relevant Test Inputs from Bug Reports [10.587260348588064]
We introduce BRMiner, a novel approach that leverages Large Language Models (LLMs) in combination with traditional techniques to extract relevant inputs from bug reports.<n>In this study, we evaluate BRMiner using the Defects4J benchmark and test generation tools such as EvoSuite and Randoop.<n>Our results demonstrate that BRMiner achieves a Relevant Input Rate (RIR) of 60.03% and a Relevant Input Extraction Accuracy Rate (RIEAR) of 31.71%.
arXiv Detail & Related papers (2023-12-22T18:19:33Z)
Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models [43.44112117935541]
Large language models (LLMs) encode potential biases while retaining disparities that can harm humans during interactions. We propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning.
arXiv Detail & Related papers (2023-10-17T08:56:04Z)
On Pitfalls of Test-Time Adaptation [82.8392232222119]
Test-Time Adaptation (TTA) has emerged as a promising approach for tackling the robustness challenge under distribution shifts. We present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols.
arXiv Detail & Related papers (2023-06-06T09:35:29Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.