Related papers: BugMentor: Generating Answers to Follow-up Questions from Software Bug Reports using Structured Information Retrieval and Neural Text Generation

BugMentor: Generating Answers to Follow-up Questions from Software Bug Reports using Structured Information Retrieval and Neural Text Generation

URL: http://arxiv.org/abs/2304.12494v4
Date: Fri, 12 Sep 2025 21:03:03 GMT
Title: BugMentor: Generating Answers to Follow-up Questions from Software Bug Reports using Structured Information Retrieval and Neural Text Generation
Authors: Usmi Mukherjee, Mohammad Masudur Rahman,
Abstract summary: We propose BugMentor, a novel approach that combines structured information retrieval and neural text generation to generate appropriate answers to follow-up questions.<n>Our technique identifies the past relevant bug reports to a given bug report, captures contextual information, and then leverages it to generate the answers.<n>We achieve a BLEU Score of up to 72 and a Semantic Similarity of up to 92, indicating that our technique can generate understandable and good answers to the follow-up questions.
Score: 0.9298382208776371
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Software bug reports often lack crucial information (e.g., steps to reproduce), which makes bug resolution challenging. Developers thus ask follow-up questions to capture additional information. However, according to existing evidence, bug reporters often face difficulties answering them, which leads to the premature closing of bug reports without any resolution. Recent studies suggest follow-up questions to support the developers, but answering the follow-up questions still remains a major challenge. In this paper, we propose BugMentor, a novel approach that combines structured information retrieval and neural text generation (e.g., Mistral) to generate appropriate answers to the follow-up questions. Our technique identifies the past relevant bug reports to a given bug report, captures contextual information, and then leverages it to generate the answers. We evaluate our generated answers against the ground truth answers using four appropriate metrics, including the BLEU Score and the Semantic Similarity. We achieve a BLEU Score of up to 72 and a Semantic Similarity of up to 92, indicating that our technique can generate understandable and good answers to the follow-up questions according to Google's AutoML Translation documentation. Our technique also outperforms four existing baselines with a statistically significant margin. We also conduct a developer study involving 23 participants where the answers from our technique were found to be more accurate, more precise, more concise and more useful.

Related papers

No Stupid Questions: An Analysis of Question Query Generation for Citation Recommendation [29.419731388642393]
GPT-4o-mini asks questions which, when answered, could expose new insights about an excerpt from a scientific article.<n>We evaluate the utility of these questions as retrieval queries, measuring their effectiveness in retrieving and ranking masked target documents.
arXiv Detail & Related papers (2025-06-09T20:13:32Z)
Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources. We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z)
Towards Detecting Prompt Knowledge Gaps for Improved LLM-guided Issue Resolution [3.768737590492549]
We analyze 433 developer-ChatGPT conversations within GitHub issue threads to examine the impact of prompt knowledge gaps and conversation styles on issue resolution.<n>We find that ineffective conversations contain knowledge gaps in 44.6% of prompts, compared to only 12.6% in effective ones.
arXiv Detail & Related papers (2025-01-20T19:41:42Z)
Improved IR-based Bug Localization with Intelligent Relevance Feedback [2.9312156642007294]
Software bugs pose a significant challenge during development and maintenance, and practitioners spend nearly 50% of their time dealing with bugs. Many existing techniques adopt Information Retrieval (IR) to localize a reported bug using textual and semantic relevance between bug reports and source code. We present a novel technique for bug localization - BRaIn - that addresses the contextual gaps by assessing the relevance between bug reports and code.
arXiv Detail & Related papers (2025-01-17T20:29:38Z)
Open Domain Question Answering with Conflicting Contexts [55.739842087655774]
We find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We ask our annotators to provide explanations for their selections of correct answers.
arXiv Detail & Related papers (2024-10-16T07:24:28Z)
I Could've Asked That: Reformulating Unanswerable Questions [89.93173151422636]
We evaluate open-source and proprietary models for reformulating unanswerable questions. GPT-4 and Llama2-7B successfully reformulate questions only 26% and 12% of the time, respectively. We publicly release the benchmark and the code to reproduce the experiments.
arXiv Detail & Related papers (2024-07-24T17:59:07Z)
Localizing and Mitigating Errors in Long-form Question Answering [79.63372684264921]
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.
arXiv Detail & Related papers (2024-07-16T17:23:16Z)
Alexpaca: Learning Factual Clarification Question Generation Without Examples [19.663171923249283]
We present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks. Humans outperform GPT-4 by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics.
arXiv Detail & Related papers (2023-10-17T20:40:59Z)
Answering Ambiguous Questions with a Database of Questions, Answers, and Revisions [95.92276099234344]
We present a new state-of-the-art for answering ambiguous questions that exploits a database of unambiguous questions generated from Wikipedia. Our method improves performance by 15% on recall measures and 10% on measures which evaluate disambiguating questions from predicted outputs.
arXiv Detail & Related papers (2023-08-16T20:23:16Z)
Prompting Is All You Need: Automated Android Bug Replay with Large Language Models [28.69675481931385]
We propose AdbGPT, a new lightweight approach to automatically reproduce the bugs from bug reports through prompt engineering. AdbGPT leverages few-shot learning and chain-of-thought reasoning to elicit human knowledge and logical reasoning from LLMs. Our evaluations demonstrate the effectiveness and efficiency of our AdbGPT to reproduce 81.3% of bug reports in 253.6 seconds.
arXiv Detail & Related papers (2023-06-03T03:03:52Z)
Auto-labelling of Bug Report using Natural Language Processing [0.0]
Rule and Query-based solutions recommend a long list of potential similar bug reports with no clear ranking. In this paper, we have proposed a solution using a combination of NLP techniques. It uses a custom data transformer, a deep neural network, and a non-generalizing machine learning method to retrieve existing identical bug reports.
arXiv Detail & Related papers (2022-12-13T02:32:42Z)
Explaining Software Bugs Leveraging Code Structures in Neural Machine Translation [5.079750706023254]
Bugsplainer generates natural language explanations for software bugs by learning from a large corpus of bug-fix commits. Our evaluation using three performance metrics shows that Bugsplainer can generate understandable and good explanations according to Google's standard. We also conduct a developer study involving 20 participants where the explanations from Bugsplainer were found to be more accurate, more precise, more concise and more useful than the baselines.
arXiv Detail & Related papers (2022-12-08T22:19:45Z)
Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers. We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z)
Automatic Classification of Bug Reports Based on Multiple Text Information and Reports' Intention [37.67372105858311]
This paper proposes a new automatic classification method for bug reports. The innovation is that when categorizing bug reports, in addition to using the text information of the report, the intention of the report is also considered. Our proposed method achieves better performance and its F-Measure achieves from 87.3% to 95.5%.
arXiv Detail & Related papers (2022-08-02T06:44:51Z)
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers [66.11048565324468]
We present a dataset of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers.
arXiv Detail & Related papers (2021-05-07T00:12:34Z)
Attention-based model for predicting question relatedness on Stack Overflow [0.0]
We propose an Attention-based Sentence pair Interaction Model (ASIM) to predict the relatedness between questions on Stack Overflow automatically. ASIM has made significant improvement over the baseline approaches in Precision, Recall, and Micro-F1 evaluation metrics. Our model also performs well in the duplicate question detection task of Ask Ubuntu.
arXiv Detail & Related papers (2021-03-19T12:18:03Z)
Challenges in Information-Seeking QA: Unanswerable Questions and Paragraph Retrieval [46.3246135936476]
We analyze why answering information-seeking queries is more challenging and where their prevalent unanswerabilities arise. Our controlled experiments suggest two headrooms -- paragraph selection and answerability prediction. We manually annotate 800 unanswerable examples across six languages on what makes them challenging to answer.
arXiv Detail & Related papers (2020-10-22T17:48:17Z)
Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document. We show that readers engage in a series of pragmatic strategies to seek information. We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.