Related papers: PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection

PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection

URL: http://arxiv.org/abs/2309.06229v4
Date: Sat, 9 Mar 2024 22:43:56 GMT
Title: PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection
Authors: He Ye, Zimin Chen and Claire Le Goues
Abstract summary: We introduce PreciseBugCollector, a precise, multi-language bug collection approach. It is based on two novel components: a bug tracker to map the repositories with external bug repositories to trace bug type information, and a bug injector to generate project-specific bugs. To date, PreciseBugCollector comprises 1057818 bugs extracted from 2968 open-source projects.
Score: 8.79879909193717
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bug datasets are vital for enabling deep learning techniques to address software maintenance tasks related to bugs. However, existing bug datasets suffer from precise and scale limitations: they are either small-scale but precise with manual validation or large-scale but imprecise with simple commit message processing. In this paper, we introduce PreciseBugCollector, a precise, multi-language bug collection approach that overcomes these two limitations. PreciseBugCollector is based on two novel components: a) A bug tracker to map the codebase repositories with external bug repositories to trace bug type information, and b) A bug injector to generate project-specific bugs by injecting noise into the correct codebases and then executing them against their test suites to obtain test failure messages. We implement PreciseBugCollector against three sources: 1) A bug tracker that links to the national vulnerability data set (NVD) to collect general-wise vulnerabilities, 2) A bug tracker that links to OSS-Fuzz to collect general-wise bugs, and 3) A bug injector based on 16 injection rules to generate project-wise bugs. To date, PreciseBugCollector comprises 1057818 bugs extracted from 2968 open-source projects. Of these, 12602 bugs are sourced from bug repositories (NVD and OSS-Fuzz), while the remaining 1045216 project-specific bugs are generated by the bug injector. Considering the challenge objectives, we argue that a bug injection approach is highly valuable for the industrial setting, since project-specific bugs align with domain knowledge, share the same codebase, and adhere to the coding style employed in industrial projects.

Related papers

GitBugs: Bug Reports for Duplicate Detection, Retrieval Augmented Generation, Triage, and More [0.0]
We present GitBugs, a comprehen-sive and up-to-date dataset of over 150,000 bug reports from nine actively maintained open-source projects. GitBugs aggregates data from Github, Bugzilla and Jira issue trackers, offering standardized categorical fields for classification tasks. It includes ex- ploratory analysis notebooks and detailed project-level statistics, such as duplicate rates and resolution times.
arXiv Detail & Related papers (2025-04-13T16:55:28Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Automated Bug Generation in the era of Large Language Models [6.0770779409377775]
BugFarm transforms arbitrary code into multiple complex bugs. A comprehensive evaluation of 435k+ bugs from over 1.9M mutants generated by BUGFARM.
arXiv Detail & Related papers (2023-10-03T20:01:51Z)
Large Language Models of Code Fail at Completing Code with Potential Bugs [30.80172644795715]
We study the buggy-code completion problem inspired by real-time code suggestion. We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs.
arXiv Detail & Related papers (2023-06-06T06:35:27Z)
WELL: Applying Bug Detectors to Bug Localization via Weakly Supervised Learning [37.09621161662761]
This paper proposes a WEakly supervised bug LocaLization (WELL) method to train a bug localization model. With CodeBERT finetuned on the buggy-or-not binary labeled data, WELL can address bug localization in a weakly supervised manner.
arXiv Detail & Related papers (2023-05-27T06:34:26Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Explaining Software Bugs Leveraging Code Structures in Neural Machine Translation [5.079750706023254]
Bugsplainer generates natural language explanations for software bugs by learning from a large corpus of bug-fix commits. Our evaluation using three performance metrics shows that Bugsplainer can generate understandable and good explanations according to Google's standard. We also conduct a developer study involving 20 participants where the explanations from Bugsplainer were found to be more accurate, more precise, more concise and more useful than the baselines.
arXiv Detail & Related papers (2022-12-08T22:19:45Z)
Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers. We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z)
BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization. We provide a general benchmark with a diversity of real and synthetic Java bugs. We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z)
DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem. The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network. To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z)
Advaita: Bug Duplicity Detection System [1.9624064951902522]
Duplicate bugs rate (% of duplicate bugs) are in the range from single digit (1 to 9%) to double digits (40%) based on the product maturity, size of the code and number of engineers working on the project. Detecting duplicity deals with identifying whether any two bugs convey the same meaning. This approach considers multiple sets of features viz. basic text statistical features, semantic features and contextual features.
arXiv Detail & Related papers (2020-01-24T04:48:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.