Related papers: A Comparison of Conversational Models and Humans in Answering Technical Questions: the Firefox Case

A Comparison of Conversational Models and Humans in Answering Technical Questions: the Firefox Case

URL: http://arxiv.org/abs/2510.21933v1
Date: Fri, 24 Oct 2025 18:05:01 GMT
Title: A Comparison of Conversational Models and Humans in Answering Technical Questions: the Firefox Case
Authors: Joao Correia, Daniel Coutinho, Marco Castelluccio, Caio Barbosa, Rafael de Mello, Anita Sarma, Alessandro Garcia, Marco Gerosa, Igor Steinmacher,
Abstract summary: This study evaluates the effectiveness of Retrieval-Augmented Generation in assisting developers within the Mozilla Firefox project.<n>We conducted an empirical analysis comparing responses from human developers, a standard GPT model, and a GPT model enhanced with RAG.<n>The results show the potential to apply RAG-based tools to Open Source Software to minimize the load to core maintainers without losing answer quality.
Score: 41.39414744243529
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The use of Large Language Models (LLMs) to support tasks in software development has steadily increased over recent years. From assisting developers in coding activities to providing conversational agents that answer newcomers' questions. In collaboration with the Mozilla Foundation, this study evaluates the effectiveness of Retrieval-Augmented Generation (RAG) in assisting developers within the Mozilla Firefox project. We conducted an empirical analysis comparing responses from human developers, a standard GPT model, and a GPT model enhanced with RAG, using real queries from Mozilla's developer chat rooms. To ensure a rigorous evaluation, Mozilla experts assessed the responses based on helpfulness, comprehensiveness, and conciseness. The results show that RAG-assisted responses were more comprehensive than human developers (62.50% to 54.17%) and almost as helpful (75.00% to 79.17%), suggesting RAG's potential to enhance developer assistance. However, the RAG responses were not as concise and often verbose. The results show the potential to apply RAG-based tools to Open Source Software (OSS) to minimize the load to core maintainers without losing answer quality. Toning down retrieval mechanisms and making responses even shorter in the future would enhance developer assistance in massive projects like Mozilla Firefox.

Related papers

Human-Aligned Enhancement of Programming Answers with LLMs Guided by User Feedback [3.1358838725251683]
Large Language Models (LLMs) are widely used to support software developers in tasks such as code generation, optimization, and documentation.<n>Yet their ability to improve existing programming answers in a human-like manner remains underexplored.<n>This study investigates whether LLMs can enhance programming answers by interpreting and incorporating comment-based feedback.
arXiv Detail & Related papers (2026-01-24T21:50:36Z)
EvolveSearch: An Iterative Self-Evolving Search Agent [98.18686493123785]
Large language models (LLMs) have transformed agentic information seeking capabilities through the integration of tools such as search engines and web browsers.<n>We propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data.
arXiv Detail & Related papers (2025-05-28T15:50:48Z)
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation [52.3707788779464]
We introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD)<n>ARC-JSD enables efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling.<n> Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements.
arXiv Detail & Related papers (2025-05-22T09:04:03Z)
Unanswerability Evaluation for Retrieval Augmented Generation [74.3022365715597]
UAEval4RAG is a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively.<n>We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries.
arXiv Detail & Related papers (2024-12-16T19:11:55Z)
Assessing the Answerability of Queries in Retrieval-Augmented Code Generation [7.68409881755304]
This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated. We build a benchmark dataset called Retrieval-augmented Code Generability Evaluation (RaCGEval) to evaluate the performance of models performing this task.
arXiv Detail & Related papers (2024-11-08T13:09:14Z)
ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions [52.33835101586687]
We study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it.<n>We propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study [45.69867169347836]
Retrieval-augmented generation (RAG) is an effective approach to mitigating the hallucination of large language models (LLMs) through the integration of external knowledge.<n>In this paper, we standardize a benchmark dataset that combines structured and unstructured knowledge across diverse domains.<n>We also develop a plug-and-play RAG framework, textbfPruningRAG, whose main characteristic is the use of multi-granularity pruning strategies.
arXiv Detail & Related papers (2024-09-03T03:31:37Z)
LLM Agents Improve Semantic Code Search [6.047454623201181]
We introduce the approach of using Retrieval Augmented Generation powered agents to inject information into user prompts. By utilizing RAG, agents enhance user queries with relevant details from GitHub repositories, making them more informative and contextually aligned. Experimental results on the CodeSearchNet dataset demonstrate that RepoRift significantly outperforms existing methods.
arXiv Detail & Related papers (2024-08-05T00:43:56Z)
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering [61.19126689470398]
Long-form RobustQA (LFRQA) is a new dataset covering 26K queries and large corpora across seven different domains. We show via experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
arXiv Detail & Related papers (2024-07-19T03:02:51Z)
Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human in the Loop [44.51779041553597]
Large Language Models have found application in mundane and repetitive tasks including Human Resource (HR) support. We developed an HR support chatbots as an efficient and effective tool for addressing employee inquiries. Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data. Through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and demonstrate reliability closely aligned with that of human evaluation.
arXiv Detail & Related papers (2024-07-08T13:32:14Z)
StackRAG Agent: Improving Developer Answers with Retrieval-Augmented Generation [2.225268436173329]
StackRAG is a retrieval-augmented Multiagent generation tool based on Large Language Models. It combines the two worlds: aggregating the knowledge from SO to enhance the reliability of the generated answers. Initial evaluations show that the generated answers are correct, accurate, relevant, and useful.
arXiv Detail & Related papers (2024-06-19T21:07:35Z)
FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention.<n>Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers.<n>Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.