Related papers: Human-Aligned Enhancement of Programming Answers with LLMs Guided by User Feedback

Human-Aligned Enhancement of Programming Answers with LLMs Guided by User Feedback

URL: http://arxiv.org/abs/2601.17604v1
Date: Sat, 24 Jan 2026 21:50:36 GMT
Title: Human-Aligned Enhancement of Programming Answers with LLMs Guided by User Feedback
Authors: Suborno Deb Bappon, Saikat Mondal, Chanchal K. Roy, Kevin Schneider,
Abstract summary: Large Language Models (LLMs) are widely used to support software developers in tasks such as code generation, optimization, and documentation.<n>Yet their ability to improve existing programming answers in a human-like manner remains underexplored.<n>This study investigates whether LLMs can enhance programming answers by interpreting and incorporating comment-based feedback.
Score: 3.1358838725251683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are widely used to support software developers in tasks such as code generation, optimization, and documentation. However, their ability to improve existing programming answers in a human-like manner remains underexplored. On technical question-and-answer platforms such as Stack Overflow (SO), contributors often revise answers based on user comments that identify errors, inefficiencies, or missing explanations. Yet roughly one-third of this feedback is never addressed due to limited time, expertise, or visibility, leaving many answers incomplete or outdated. This study investigates whether LLMs can enhance programming answers by interpreting and incorporating comment-based feedback. We make four main contributions. First, we introduce ReSOlve, a benchmark consisting of 790 SO answers with associated comment threads, annotated for improvement-related and general feedback. Second, we evaluate four state-of-the-art LLMs on their ability to identify actionable concerns, finding that DeepSeek achieves the best balance between precision and recall. Third, we present AUTOCOMBAT, an LLM-powered tool that improves programming answers by jointly leveraging user comments and question context. Compared to human revised references, AUTOCOMBAT produces near-human quality improvements while preserving the original intent and significantly outperforming the baseline. Finally, a user study with 58 practitioners shows strong practical value, with 84.5 percent indicating they would adopt or recommend the tool. Overall, AUTOCOMBAT demonstrates the potential of scalable, feedback-driven answer refinement to improve the reliability and trustworthiness of technical knowledge platforms.

Related papers

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits [72.23150343093447]
We introduce EDIT-Bench, a benchmark for evaluatingstructed code editing capabilities grounded in real-world usage.<n>EDIT-Bench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases.<n>We find that model performance varies across different categories of user instructions.
arXiv Detail & Related papers (2025-11-06T16:05:28Z)
Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks [28.849481030601666]
We introduce FeedbackEval, a benchmark for evaluating large language models' feedback comprehension and performance.<n>We conduct a comprehensive empirical study on five state-of-the-art LLMs, including GPT-4o, Claude-3.5, Gemini-1.5, GLM-4, and Qwen2.5.<n>Our results show that structured feedback, particularly in the form of test feedback, leads to the highest repair success rates, while unstructured feedback proves significantly less effective.
arXiv Detail & Related papers (2025-04-09T14:43:08Z)
ACE-RLHF: Automated Code Evaluation and Socratic Feedback Generation Tool using Large Language Models and Reinforcement Learning with Human Feedback [4.503215272392276]
Large Language Models (LLMs) for code-feedback generation is crucial.<n>LLMs generate more comprehensible feedback than compiler-generated error messages.<n>Reinforcement Learning with Human Feedback (RLHF) helps novice students to lean programming from scratch interactively.
arXiv Detail & Related papers (2025-04-07T01:11:22Z)
Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation [5.6001617185032595]
Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks. We fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank fashion on consumer-grade hardware to improve review comment generation.
arXiv Detail & Related papers (2024-11-15T12:01:38Z)
Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub. 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z)
Large Language Models as Evaluators for Recommendation Explanations [23.938202791437337]
We investigate whether LLMs can serve as evaluators of recommendation explanations. We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users. Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts.
arXiv Detail & Related papers (2024-06-05T13:23:23Z)
KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions [63.307317584926146]
Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. We construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain.
arXiv Detail & Related papers (2024-03-06T17:16:44Z)
Improving the Validity of Automatically Generated Feedback via Reinforcement Learning [46.667783153759636]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)<n>Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z)
Self-Knowledge Guided Retrieval Augmentation for Large Language Models [59.771098292611846]
Large language models (LLMs) have shown superior performance without task-specific fine-tuning. Retrieval-based methods can offer non-parametric world knowledge and improve the performance on tasks such as question answering. Self-Knowledge guided Retrieval augmentation (SKR) is a simple yet effective method which can let LLMs refer to the questions they have previously encountered.
arXiv Detail & Related papers (2023-10-08T04:22:33Z)
UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.