Related papers: Gotta catch 'em all! Towards File Localisation from Issues at Large

Gotta catch 'em all! Towards File Localisation from Issues at Large

URL: http://arxiv.org/abs/2507.18319v1
Date: Thu, 24 Jul 2025 11:42:13 GMT
Title: Gotta catch 'em all! Towards File Localisation from Issues at Large
Authors: Jesse Maarleveld, Jiapan Guo, Daniel Feitosa,
Abstract summary: This work provides a data pipeline for the creation of issue file localisation datasets.<n>We provide a baseline performance evaluation for the file localisation problem using traditional information retrieval approaches.<n>We use statistical analysis to investigate the influence of biases known in the bug localisation community on our dataset.
Score: 2.1574657220935602
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Bug localisation, the study of developing methods to localise the files requiring changes to resolve bugs, has been researched for a long time to develop methods capable of saving developers' time. Recently, researchers are starting to consider issues outside of bugs. Nevertheless, most existing research into file localisation from issues focusses on bugs or uses other selection methods to ensure only certain types of issues are considered as part of the focus of the work. Our goal is to work on all issues at large, without any specific selection. In this work, we provide a data pipeline for the creation of issue file localisation datasets, capable of dealing with arbitrary branching and merging practices. We provide a baseline performance evaluation for the file localisation problem using traditional information retrieval approaches. Finally, we use statistical analysis to investigate the influence of biases known in the bug localisation community on our dataset. Our results show that methods designed using bug-specific heuristics perform poorly on general issue types, indicating a need for research into general purpose models. Furthermore, we find that there are small, but statistically significant differences in performance between different issue types. Finally, we find that the presence of identifiers have a small effect on performance for most issue types. Many results are project-dependent, encouraging the development of methods which can be tuned to project-specific characteristics.

Related papers

BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z)
BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning [1.9854146581797698]
BLAZE is an approach that employs dynamic chunking and hard example learning.<n>It fine-tunes a GPT-based model using challenging bug cases to enhance cross-project and cross-language bug localization.<n>BLAZE achieves up to an increase of 120% in Top 1 accuracy, 144% in Mean Average Precision (MAP), and 100% in Mean Reciprocal Rank (MRR)
arXiv Detail & Related papers (2024-07-24T20:44:36Z)
Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization [0.7564784873669823]
Bug localization refers to the identification of source code files which is in a programming language. Our study evaluated 14 distinct embedding models to gain insights into the effects of various design choices. Our findings indicate that the pre-training strategies significantly affect the quality of the embedding.
arXiv Detail & Related papers (2024-06-25T15:01:39Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper proposes a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.<n>The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.<n>We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
On Using GUI Interaction Data to Improve Text Retrieval-based Bug Localization [10.717184444794505]
We investigate the hypothesis that, for end user-facing applications, connecting information in a bug report with information from the GUI, can improve upon existing techniques for bug localization. We source the current largest dataset of fully-localized and reproducible real bugs for Android apps, with corresponding bug reports.
arXiv Detail & Related papers (2023-10-12T07:14:22Z)
WELL: Applying Bug Detectors to Bug Localization via Weakly Supervised Learning [37.09621161662761]
This paper proposes a WEakly supervised bug LocaLization (WELL) method to train a bug localization model. With CodeBERT finetuned on the buggy-or-not binary labeled data, WELL can address bug localization in a weakly supervised manner.
arXiv Detail & Related papers (2023-05-27T06:34:26Z)
BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization. We provide a general benchmark with a diversity of real and synthetic Java bugs. We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z)
Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets. We define a uniform evaluation setup including a new formalization of the annotation error detection task. We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z)
Compactness Score: A Fast Filter Method for Unsupervised Feature Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features. Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z)
DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem. The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network. To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
A Fault Localization and Debugging Support Framework driven by Bug Tracking Data [0.11915976684257382]
This thesis aims to provide a fault localization framework by combining data from various sources. To achieve this, a bug classification schema is introduced, benchmarks are created, and a novel fault localization method based on historical data is proposed.
arXiv Detail & Related papers (2021-03-03T13:23:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.