Related papers: BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies

BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies

URL: http://arxiv.org/abs/2401.15481v1
Date: Sat, 27 Jan 2024 19:07:34 GMT
Title: BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies
Authors: Ratnadira Widyasari, Sheng Qin Sim, Camellia Lok, Haodi Qi, Jack Phan, Qijin Tay, Constance Tan, Fiona Wee, Jodie Ethelda Tan, Yuheng Yieh, Brian Goh, Ferdian Thung, Hong Jin Kang, Thong Hoang, David Lo, Eng Lieh Ouh
Abstract summary: For the first time, Python outperformed Java in Stack Overflow developer survey. This is in stark contrast with the abundance of testing and debug tools for Java. In this project, we create a benchmark database and tool that contain 493 real bugs from 17 real-world Python programs.
Score: 8.746971239693066
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The 2019 edition of Stack Overflow developer survey highlights that, for the first time, Python outperformed Java in terms of popularity. The gap between Python and Java further widened in the 2020 edition of the survey. Unfortunately, despite the rapid increase in Python's popularity, there are not many testing and debugging tools that are designed for Python. This is in stark contrast with the abundance of testing and debugging tools for Java. Thus, there is a need to push research on tools that can help Python developers. One factor that contributed to the rapid growth of Java testing and debugging tools is the availability of benchmarks. A popular benchmark is the Defects4J benchmark; its initial version contained 357 real bugs from 5 real-world Java programs. Each bug comes with a test suite that can expose the bug. Defects4J has been used by hundreds of testing and debugging studies and has helped to push the frontier of research in these directions. In this project, inspired by Defects4J, we create another benchmark database and tool that contain 493 real bugs from 17 real-world Python programs. We hope our benchmark can help catalyze future work on testing and debugging tools that work on Python programs.

Related papers

pyMethods2Test: A Dataset of Python Tests Mapped to Focal Methods [0.21485350418225244]
Python is one of the fastest-growing programming languages and currently ranks as the top language in many lists. It is imperative to be able to effectively train LLMs to generate good unit test cases for Python code. This motivates the need for a large dataset to provide training and testing data.
arXiv Detail & Related papers (2025-02-07T18:19:12Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
ChatDBG: An AI-Powered Debugging Assistant [0.0]
ChatDBG lets programmers engage in a collaborative dialogue with the debugger. It can perform root cause analysis for crashes or assertion failures. ChatDBG has seen rapid uptake; it has already been downloaded roughly 50,000 times.
arXiv Detail & Related papers (2024-03-25T01:12:57Z)
GitBug-Java: A Reproducible Benchmark of Recent Java Bugs [8.508198765617196]
We present GitBug-Java, a reproducible benchmark of recent Java bugs. GitBug-Java features 199 bugs extracted from the 2023 commit history of 55 notable open-source repositories.
arXiv Detail & Related papers (2024-02-05T12:40:41Z)
SBFT Tool Competition 2024 -- Python Test Case Generation Track [4.149356993529412]
Test case generation (TCG) for Python poses distinctive challenges due to the language's dynamic nature and the absence of strict type information. Previous research has successfully explored automated unit TCG for Python, with solutions outperforming random test generation methods. This paper describes our methodology, the analysis of the results together with the competing tools, and the challenges faced while running the competition experiments.
arXiv Detail & Related papers (2024-01-26T20:21:15Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Causal-learn: Causal Discovery in Python [53.17423883919072]
Causal discovery aims at revealing causal relations from observational data. $textitcausal-learn$ is an open-source Python library for causal discovery.
arXiv Detail & Related papers (2023-07-31T05:00:35Z)
Tests4Py: A Benchmark for System Testing [11.051969638361012]
Tests4Py benchmark includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation.
arXiv Detail & Related papers (2023-07-11T10:04:52Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation [70.96868419971756]
DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
arXiv Detail & Related papers (2022-11-18T17:20:27Z)
PyGOD: A Python Library for Graph Outlier Detection [56.33769221859135]
PyGOD is an open-source library for detecting outliers in graph data. It supports a wide array of leading graph-based methods for outlier detection. PyGOD is released under a BSD 2-Clause license at https://pygod.org and at the Python Package Index (PyPI)
arXiv Detail & Related papers (2022-04-26T06:15:21Z)
DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons [5.564793925574796]
We present an approach to automated debug using large, pretrained transformers. We start by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs. Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions that are covered by passing tests.
arXiv Detail & Related papers (2021-05-19T18:40:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.