GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub
Actions
- URL: http://arxiv.org/abs/2310.15642v3
- Date: Sun, 21 Jan 2024 12:01:33 GMT
- Title: GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub
Actions
- Authors: Nuno Saavedra, Andr\'e Silva, Martin Monperrus
- Abstract summary: We present GitBug-Actions, a novel tool for building bug-fix benchmarks with modern and fully-reproducible bug-fixes.
GitBug-Actions relies on the most popular CI platform, GitHub Actions, to detect bug-fixes.
To demonstrate our toolchain, we deploy GitBug-Actions to build a proof-of-concept Go bug-fix benchmark.
- Score: 8.508198765617196
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Bug-fix benchmarks are fundamental in advancing various sub-fields of
software engineering such as automatic program repair (APR) and fault
localization (FL). A good benchmark must include recent examples that
accurately reflect technologies and development practices of today. To be
executable in the long term, a benchmark must feature test suites that do not
degrade overtime due to, for example, dependencies that are no longer
available. Existing benchmarks fail in meeting both criteria. For instance,
Defects4J, one of the foremost Java benchmarks, last received an update in
2020. Moreover, full-reproducibility has been neglected by the majority of
existing benchmarks. In this paper, we present GitBug-Actions: a novel tool for
building bug-fix benchmarks with modern and fully-reproducible bug-fixes.
GitBug-Actions relies on the most popular CI platform, GitHub Actions, to
detect bug-fixes and smartly locally execute the CI pipeline in a controlled
and reproducible environment. To the best of our knowledge, we are the first to
rely on GitHub Actions to collect bug-fixes. To demonstrate our toolchain, we
deploy GitBug-Actions to build a proof-of-concept Go bug-fix benchmark
containing executable, fully-reproducible bug-fixes from different
repositories. A video demonstrating GitBug-Actions is available at:
https://youtu.be/aBWwa1sJYBs.
Related papers
- KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content.
evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address.
JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z) - MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution [47.850418420195304]
Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving GitHub issues.
We propose a novel Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution.
arXiv Detail & Related papers (2024-03-26T17:57:57Z) - GitBug-Java: A Reproducible Benchmark of Recent Java Bugs [8.508198765617196]
We present GitBug-Java, a reproducible benchmark of recent Java bugs.
GitBug-Java features 199 bugs extracted from the 2023 commit history of 55 notable open-source repositories.
arXiv Detail & Related papers (2024-02-05T12:40:41Z) - RaceFixer -- An Automated Data Race Fixer [0.0]
RaceFixer automates the process of fixing one common type of bug: single-variable atomicity violations.
It tries to combine the patches of multiple bugs for better performance and code readability.
arXiv Detail & Related papers (2024-01-08T20:25:14Z) - WRTester: Differential Testing of WebAssembly Runtimes via
Semantic-aware Binary Generation [19.78427170624683]
We present WRTester, a novel differential testing framework that can generated complicated Wasm test cases by disassembling and assembling real-world Wasm binaries.
For further pinpointing the root causes of unexpected behaviors, we design a runtime-agnostic root cause location method to accurately locate bugs.
We have uncovered 33 unique bugs in popular Wasm runtimes, among which 25 have been confirmed.
arXiv Detail & Related papers (2023-12-16T14:02:42Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z) - Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and
Code Skeletons [5.564793925574796]
We present an approach to automated debug using large, pretrained transformers.
We start by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs.
Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions that are covered by passing tests.
arXiv Detail & Related papers (2021-05-19T18:40:16Z) - Generating Bug-Fixes Using Pretrained Transformers [11.012132897417592]
We introduce a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub.
We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch.
We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art.
arXiv Detail & Related papers (2021-04-16T05:27:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.