Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem
- URL: http://arxiv.org/abs/2510.09907v1
- Date: Fri, 10 Oct 2025 22:43:54 GMT
- Title: Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem
- Authors: Muhammad Maaz, Liam DeVoe, Zac Hatfield-Dodds, Nicholas Carlini,
- Abstract summary: Property-based testing (PBT) is a lightweight formal method, typically implemented as a randomized testing framework.<n>In this work, we demonstrate an LLM-based agent which analyzes Python modules, infers function-specific and cross-function properties from code and documentation, synthesizes and executes PBTs.
- Score: 34.68658860352019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Property-based testing (PBT) is a lightweight formal method, typically implemented as a randomized testing framework. Users specify the input domain for their test using combinators supplied by the PBT framework, and the expected properties or invariants as a unit-test function. The framework then searches for a counterexample, e.g. by generating inputs and calling the test function. In this work, we demonstrate an LLM-based agent which analyzes Python modules, infers function-specific and cross-function properties from code and documentation, synthesizes and executes PBTs, reflects on outputs of these tests to confirm true bugs, and finally outputs actionable bug reports for the developer. We perform an extensive evaluation of our agent across 100 popular Python packages. Of the bug reports generated by the agent, we found after manual review that 56\% were valid bugs and 32\% were valid bugs that we would report to maintainers. We then developed a ranking rubric to surface high-priority valid bugs to developers, and found that of the 21 top-scoring bugs, 86\% were valid and 81\% we would report. The bugs span diverse failure modes from serialization failures to numerical precision errors to flawed cache implementations. We reported 5 bugs, 4 with patches, including to NumPy and cloud computing SDKs, with 3 patches merged successfully. Our results suggest that LLMs with PBT provides a rigorous and scalable method for autonomously testing software. Our code and artifacts are available at: https://github.com/mmaaz-git/agentic-pbt.
Related papers
- Understanding Bug-Reproducing Tests: A First Empirical Study [10.004295333072948]
We analyze 642 bug-reproducing tests of 15 real-world Python systems.<n>We find that bug-reproducing tests are not (statistically significantly) different from other tests regarding LOC, number of assertions, and complexity.<n>We detect that the majority 95% of the bug-reproducing tests reproduce a single bug, while 5% reproduce multiple bugs.
arXiv Detail & Related papers (2026-02-03T01:04:18Z) - Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z) - BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z) - Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z) - May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs [20.03968975178177]
fuzz testing (Fuzzing) is a simple yet effective way to find bugs in Deep Learning (DL) frameworks.<n>We propose FUEL to effectively utilize the feedback information, which comprises two Large Language Models (LLMs): analysis LLM and generation LLM.<n>We show that FUEL can improve line code coverage of PyTorch and execution by 9.15% and 14.70% over state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-21T08:51:53Z) - Type-aware LLM-based Regression Test Generation for Python Programs [13.631541369653066]
Test4Py is a novel framework that enhances type correctness in automated test generation for Python.<n>Test4Py integrates an iterative repair procedure that progressively refines generated test cases to improve coverage.<n>In an evaluation on 183 real-world Python modules, Test4Py achieved an average statement coverage of 83.0% and branch coverage of 70.8%.
arXiv Detail & Related papers (2025-03-18T08:07:17Z) - Can LLM Generate Regression Tests for Software Commits? [15.653758694625854]
Large Language Models (LLMs) have shown tremendous promise in automated software engineering.<n>We implement Cleverest, a feedback-directed zero-shot LLM-based regression test generation technique.<n>For programs using more human-readable file formats, like XML or JavaScript, Cleverest performed very well.
arXiv Detail & Related papers (2025-01-19T15:46:26Z) - LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs.<n>We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool.<n>Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z) - Evaluating Agent-based Program Repair at Google [9.62742759337993]
Agent-based program repair offers to automatically resolve complex bugs end-to-end.<n>Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench.<n>This paper explores the viability of using an agentic approach to address bugs in an enterprise context.
arXiv Detail & Related papers (2025-01-13T18:09:25Z) - PyPulse: A Python Library for Biosignal Imputation [58.35269251730328]
We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings.<n>PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers.<n>We released PyPulse under the MIT License on Github and PyPI.
arXiv Detail & Related papers (2024-12-09T11:00:55Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - An Empirical Evaluation of Using Large Language Models for Automated
Unit Test Generation [3.9762912548964864]
This paper presents a large-scale empirical evaluation on the effectiveness of Large Language Models for automated unit test generation.
We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package.
We find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests.
arXiv Detail & Related papers (2023-02-13T17:13:41Z) - DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and
Code Skeletons [5.564793925574796]
We present an approach to automated debug using large, pretrained transformers.
We start by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs.
Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions that are covered by passing tests.
arXiv Detail & Related papers (2021-05-19T18:40:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.