Too Few Bug Reports? Exploring Data Augmentation for Improved
Changeset-based Bug Localization
- URL: http://arxiv.org/abs/2305.16430v2
- Date: Thu, 1 Jun 2023 13:24:06 GMT
- Title: Too Few Bug Reports? Exploring Data Augmentation for Improved
Changeset-based Bug Localization
- Authors: Agnieszka Ciborowska and Kostadin Damevski
- Abstract summary: We propose novel data augmentation operators that act on different constituent components of bug reports.
We also describe a data balancing strategy that aims to create a corpus of augmented bug reports.
- Score: 7.884766610628946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern Deep Learning (DL) architectures based on transformers (e.g., BERT,
RoBERTa) are exhibiting performance improvements across a number of natural
language tasks. While such DL models have shown tremendous potential for use in
software engineering applications, they are often hampered by insufficient
training data. Particularly constrained are applications that require
project-specific data, such as bug localization, which aims at recommending
code to fix a newly submitted bug report. Deep learning models for bug
localization require a substantial training set of fixed bug reports, which are
at a limited quantity even in popular and actively developed software projects.
In this paper, we examine the effect of using synthetic training data on
transformer-based DL models that perform a more complex variant of bug
localization, which has the goal of retrieving bug-inducing changesets for each
bug report. To generate high-quality synthetic data, we propose novel data
augmentation operators that act on different constituent components of bug
reports. We also describe a data balancing strategy that aims to create a
corpus of augmented bug reports that better reflects the entire source code
base, because existing bug reports used as training data usually reference a
small part of the code base.
Related papers
- Supporting Cross-language Cross-project Bug Localization Using Pre-trained Language Models [2.5121668584771837]
Existing techniques often struggle with generalizability and deployment due to their reliance on application-specific data.
This paper proposes a novel pre-trained language model (PLM) based technique for bug localization that transcends project and language boundaries.
arXiv Detail & Related papers (2024-07-03T01:09:36Z) - Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization [0.7564784873669823]
Bug localization refers to the identification of source code files which is in a programming language.
Our study evaluated 14 distinct embedding models to gain insights into the effects of various design choices.
Our findings indicate that the pre-training strategies significantly affect the quality of the embedding.
arXiv Detail & Related papers (2024-06-25T15:01:39Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - On Using GUI Interaction Data to Improve Text Retrieval-based Bug
Localization [10.717184444794505]
We investigate the hypothesis that, for end user-facing applications, connecting information in a bug report with information from the GUI, can improve upon existing techniques for bug localization.
We source the current largest dataset of fully-localized and reproducible real bugs for Android apps, with corresponding bug reports.
arXiv Detail & Related papers (2023-10-12T07:14:22Z) - WELL: Applying Bug Detectors to Bug Localization via Weakly Supervised
Learning [37.09621161662761]
This paper proposes a WEakly supervised bug LocaLization (WELL) method to train a bug localization model.
With CodeBERT finetuned on the buggy-or-not binary labeled data, WELL can address bug localization in a weakly supervised manner.
arXiv Detail & Related papers (2023-05-27T06:34:26Z) - Auto-labelling of Bug Report using Natural Language Processing [0.0]
Rule and Query-based solutions recommend a long list of potential similar bug reports with no clear ranking.
In this paper, we have proposed a solution using a combination of NLP techniques.
It uses a custom data transformer, a deep neural network, and a non-generalizing machine learning method to retrieve existing identical bug reports.
arXiv Detail & Related papers (2022-12-13T02:32:42Z) - Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem.
The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network.
To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.