Related papers: Reproducibility of Issues Reported in Stack Overflow Questions: Challenges, Impact & Estimation

Reproducibility of Issues Reported in Stack Overflow Questions: Challenges, Impact & Estimation

URL: http://arxiv.org/abs/2407.10023v1
Date: Sat, 13 Jul 2024 22:55:35 GMT
Title: Reproducibility of Issues Reported in Stack Overflow Questions: Challenges, Impact & Estimation
Authors: Saikat Mondal, Banani Roy,
Abstract summary: Software developers often submit questions to technical Q&A sites like Stack Overflow (SO) to resolve code-level problems. In practice, they include example code snippets with questions to explain the programming issues. Unfortunately, such code snippets could not always reproduce the issues due to several unmet challenges.
Score: 2.2160604288512324
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Software developers often submit questions to technical Q&A sites like Stack Overflow (SO) to resolve code-level problems. In practice, they include example code snippets with questions to explain the programming issues. Existing research suggests that users attempt to reproduce the reported issues using given code snippets when answering questions. Unfortunately, such code snippets could not always reproduce the issues due to several unmet challenges that prevent questions from receiving appropriate and prompt solutions. One previous study investigated reproducibility challenges and produced a catalog. However, how the practitioners perceive this challenge catalog is unknown. Practitioners' perspectives are inevitable in validating these challenges and estimating their severity. This study first surveyed 53 practitioners to understand their perspectives on reproducibility challenges. We attempt to (a) see whether they agree with these challenges, (b) determine the impact of each challenge on answering questions, and (c) identify the need for tools to promote reproducibility. Survey results show that - (a) about 90% of the participants agree with the challenges, (b) "missing an important part of code" most severely hurt reproducibility, and (c) participants strongly recommend introducing automated tool support to promote reproducibility. Second, we extract \emph{nine} code-based features (e.g., LOC, compilability) and build five Machine Learning (ML) models to predict issue reproducibility. Early detection might help users improve code snippets and their reproducibility. Our models achieve 84.5% precision, 83.0% recall, 82.8% F1-score, and 82.8% overall accuracy, which are highly promising. Third, we systematically interpret the ML model and explain how code snippets with reproducible issues differ from those with irreproducible issues.

Related papers

Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
GENCNIPPET: Automated Generation of Code Snippets for Supporting Programming Questions [5.176434782905268]
Software developers often ask questions on Technical Q&A forums like Stack Overflow (SO) to seek solutions to their programming-related problems. Many questions miss required code snippets due to the lack of readily available code, time constraints, employer restrictions, confidentiality concerns, or uncertainty about what code to share. GENCNIPPET will generate relevant code examples (when required) to support questions for their timely solutions.
arXiv Detail & Related papers (2025-04-22T22:07:40Z)
A Systematic Survey on Debugging Techniques for Machine Learning Systems [5.747738795689893]
Machine learning (ML) software poses unique challenges compared to traditional software. Various methods have been proposed for testing, diagnosing, and repairing ML systems. However, the big picture informing important research directions that fulfill developers needs is yet to unfold.
arXiv Detail & Related papers (2025-03-05T03:57:20Z)
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs) We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
I Could've Asked That: Reformulating Unanswerable Questions [89.93173151422636]
We evaluate open-source and proprietary models for reformulating unanswerable questions. GPT-4 and Llama2-7B successfully reformulate questions only 26% and 12% of the time, respectively. We publicly release the benchmark and the code to reproduce the experiments.
arXiv Detail & Related papers (2024-07-24T17:59:07Z)
Localizing and Mitigating Errors in Long-form Question Answering [79.63372684264921]
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.
arXiv Detail & Related papers (2024-07-16T17:23:16Z)
PECC: Problem Extraction and Coding Challenges [3.287942619833188]
We introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate code. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset.
arXiv Detail & Related papers (2024-04-29T15:02:14Z)
Can We Identify Stack Overflow Questions Requiring Code Snippets? Investigating the Cause & Effect of Missing Code Snippets [8.107650447105998]
On the Stack Overflow (SO) Q&A site, users often request solutions to their code-related problems. They often miss required code snippets during their question submission. This study investigates the cause & effect of missing code snippets in SO questions whenever required.
arXiv Detail & Related papers (2024-02-07T04:25:31Z)
Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z)
Alexpaca: Learning Factual Clarification Question Generation Without Examples [19.663171923249283]
We present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks. Humans outperform GPT-4 by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics.
arXiv Detail & Related papers (2023-10-17T20:40:59Z)
Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code [0.0]
We analyzed effectiveness of three generative pre-trained transformer (GPT) models in answering multiple-choice question (MCQ) assessments. These findings can be leveraged by educators to adapt their instructional practices and assessments in programming courses.
arXiv Detail & Related papers (2023-03-09T16:52:12Z)
ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification. A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z)
Hurdles to Progress in Long-form Question Answering [34.805039943215284]
We show that the task formulation raises fundamental challenges regarding evaluation and dataset creation. We first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-03-10T20:32:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.