Related papers: PyResBugs: A Dataset of Residual Python Bugs for Natural Language-Driven Fault Injection

PyResBugs: A Dataset of Residual Python Bugs for Natural Language-Driven Fault Injection

URL: http://arxiv.org/abs/2505.05777v1
Date: Fri, 09 May 2025 04:39:09 GMT
Title: PyResBugs: A Dataset of Residual Python Bugs for Natural Language-Driven Fault Injection
Authors: Domenico Cotroneo, Giuseppe De Rosa, Pietro Liguori,
Abstract summary: PyResBugs is a curated dataset of residual bugs from major Python frameworks.<n>Each bug is paired with its corresponding fault-free (fixed) version and annotated with multi-level natural language (NL) descriptions.
Score: 5.383910843560784
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents PyResBugs, a curated dataset of residual bugs, i.e., defects that persist undetected during traditional testing but later surface in production, collected from major Python frameworks. Each bug in the dataset is paired with its corresponding fault-free (fixed) version and annotated with multi-level natural language (NL) descriptions. These NL descriptions enable natural language-driven fault injection, offering a novel approach to simulating real-world faults in software systems. By bridging the gap between software fault injection techniques and real-world representativeness, PyResBugs provides researchers with a high-quality resource for advancing AI-driven automated testing in Python systems.

Related papers

LLPut: Investigating Large Language Models for Bug Report-Based Input Generation [0.0]
Failure-inducing inputs play a crucial role in diagnosing and analyzing software bugs.<n>Prior research has leveraged various Natural Language Processing (NLP) techniques for automated input extraction.<n>With the advent of Large Language Models (LLMs), an important research question arises: how effectively can generative LLMs extract failure-inducing inputs from bug reports?
arXiv Detail & Related papers (2025-03-26T14:25:01Z)
PyPulse: A Python Library for Biosignal Imputation [58.35269251730328]
We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings.<n>PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers.<n>We released PyPulse under the MIT License on Github and PyPI.
arXiv Detail & Related papers (2024-12-09T11:00:55Z)
Leveraging Large Language Models in Code Question Answering: Baselines and Issues [0.1617522438111378]
This paper presents a work devoted to using large language models for question answering over source code in Python. The proposed method for implementing a source code question answering system involves fine-tuning a large language model on a unified dataset of questions and answers for Python code. We report BLEU-4, BERTScore F1, BLEURT, and Exact Match metric values, along with the conclusions from the manual error analysis.
arXiv Detail & Related papers (2024-11-05T11:25:12Z)
On Leakage of Code Generation Evaluation Datasets [44.4726918027046]
We consider contamination by code generation test sets, in particular in their use in modern large language models. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions.
arXiv Detail & Related papers (2024-07-10T11:50:20Z)
SBFT Tool Competition 2024 -- Python Test Case Generation Track [4.149356993529412]
Test case generation (TCG) for Python poses distinctive challenges due to the language's dynamic nature and the absence of strict type information. Previous research has successfully explored automated unit TCG for Python, with solutions outperforming random test generation methods. This paper describes our methodology, the analysis of the results together with the competing tools, and the challenges faced while running the competition experiments.
arXiv Detail & Related papers (2024-01-26T20:21:15Z)
A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back. Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair. This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z)
An Empirical Study of Fault Localization in Python Programs [4.366130138560774]
This paper is the first multi-family large-scale empirical study of fault localization on real-world Python programs and faults. We use Zou et al.'s recent large-scale empirical study of fault localization in Java as the basis of our study. The results replicate for Python several results known about Java, and shed light on whether Python's peculiarities affect the capabilities of fault localization.
arXiv Detail & Related papers (2023-05-31T13:21:30Z)
LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback. Our focus is the code generation task, where the model produces code based on natural language instructions. LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization. We provide a general benchmark with a diversity of real and synthetic Java bugs. We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.