Related papers: Automated Duplicate Bug Report Detection in Large Open Bug Repositories

Automated Duplicate Bug Report Detection in Large Open Bug Repositories

URL: http://arxiv.org/abs/2504.14797v1
Date: Mon, 21 Apr 2025 01:55:54 GMT
Title: Automated Duplicate Bug Report Detection in Large Open Bug Repositories
Authors: Clare E. Laney, Andrew Barovic, Armin Moin,
Abstract summary: Many users and contributors of large open-source projects report software defects or enhancement requests (known as bug reports) to the issue-tracking systems.<n>We propose a novel approach based on machine learning methods that can automatically detect duplicate bug reports in an open bug repository.
Score: 3.481985817302898
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Many users and contributors of large open-source projects report software defects or enhancement requests (known as bug reports) to the issue-tracking systems. However, they sometimes report issues that have already been reported. First, they may not have time to do sufficient research on existing bug reports. Second, they may not possess the right expertise in that specific area to realize that an existing bug report is essentially elaborating on the same matter, perhaps with a different wording. In this paper, we propose a novel approach based on machine learning methods that can automatically detect duplicate bug reports in an open bug repository based on the textual data in the reports. We present six alternative methods: Topic modeling, Gaussian Naive Bayes, deep learning, time-based organization, clustering, and summarization using a generative pre-trained transformer large language model. Additionally, we introduce a novel threshold-based approach for duplicate identification, in contrast to the conventional top-k selection method that has been widely used in the literature. Our approach demonstrates promising results across all the proposed methods, achieving accuracy rates ranging from the high 70%'s to the low 90%'s. We evaluated our methods on a public dataset of issues belonging to an Eclipse open-source project.

Related papers

Automated Bug Report Prioritization in Large Open-Source Projects [3.9134031118910264]
We propose a novel approach to automated bug prioritization based on the natural language text of the bug reports. We conduct topic modeling using a variant of LDA called TopicMiner-MTM and text classification with the BERT large language model. Experimental results using an existing reference dataset containing 85,156 bug reports of the Eclipse Platform project indicate that we outperform existing approaches in terms of Accuracy, Precision, Recall, and F1-measure of the bug report priority prediction.
arXiv Detail & Related papers (2025-04-22T13:57:48Z)
Understanding the Impact of Domain Term Explanation on Duplicate Bug Report Detection [2.9312156642007294]
Duplicate bug reports make up 42% of all reports in bug tracking systems (e.g., Bugzilla)<n>Traditional techniques often focus on detecting textually similar duplicates.<n>About 78% of bug reports in open-source projects are very short (e.g., less than 100 words) often containing domain-specific terms or jargon.
arXiv Detail & Related papers (2025-03-24T16:09:37Z)
An Empirical Study on the Classification of Bug Reports with Machine Learning [1.1499574149885023]
We study how different factors (e.g., project language, report content) can influence the performance of models in handling classification of issue reports.<n>Using the report title or description does not significantly differ; Support Vector Machine, Logistic Regression, and Random Forest are effective in classifying issue reports.<n>Models based on heterogeneous projects can classify reports from projects not present during training.
arXiv Detail & Related papers (2025-03-01T23:19:56Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
Recommending Bug Assignment Approaches for Individual Bug Reports: An Empirical Investigation [8.186068333538893]
Multiple approaches have been proposed to automatically recommend potential developers who can address bug reports. These approaches are typically designed to work for any bug report submitted to any software project. We conducted an empirical study to validate this conjecture, using three bug assignment approaches applied on 2,249 bug reports from two open source systems.
arXiv Detail & Related papers (2023-05-29T23:02:56Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
Auto-labelling of Bug Report using Natural Language Processing [0.0]
Rule and Query-based solutions recommend a long list of potential similar bug reports with no clear ranking. In this paper, we have proposed a solution using a combination of NLP techniques. It uses a custom data transformer, a deep neural network, and a non-generalizing machine learning method to retrieve existing identical bug reports.
arXiv Detail & Related papers (2022-12-13T02:32:42Z)
Automatic Classification of Bug Reports Based on Multiple Text Information and Reports' Intention [37.67372105858311]
This paper proposes a new automatic classification method for bug reports. The innovation is that when categorizing bug reports, in addition to using the text information of the report, the intention of the report is also considered. Our proposed method achieves better performance and its F-Measure achieves from 87.3% to 95.5%.
arXiv Detail & Related papers (2022-08-02T06:44:51Z)
Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets. We define a uniform evaluation setup including a new formalization of the annotation error detection task. We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z)
DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem. The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network. To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z)
S3M: Siamese Stack (Trace) Similarity Measure [55.58269472099399]
We present S3M -- the first approach to computing stack trace similarity based on deep learning. It is based on a biLSTM encoder and a fully-connected classifier to compute similarity. Our experiments demonstrate the superiority of our approach over the state-of-the-art on both open-sourced data and a private JetBrains dataset.
arXiv Detail & Related papers (2021-03-18T21:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.