Related papers: Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

URL: http://arxiv.org/abs/2503.14630v1
Date: Tue, 18 Mar 2025 18:31:36 GMT
Title: Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving
Authors: Priscylla Silva, Evandro Costa,
Abstract summary: Large Language Models (LLMs) have emerged as potential tools to automate feedback generation.<n>This study evaluates the performance of four LLMs on a benchmark dataset of 45 student solutions.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models' capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63\% of feedback hints were accurate and complete, while 37\% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.

Related papers

Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
DeepCritic: Deliberate Critique with Large Language Models [77.5516314477878]
We focus on studying and enhancing the math critique ability of Large Language Models (LLMs) Our developed critique model built on Qwen2.5-7B-Instruct significantly outperforms existing LLM critics on various error identification benchmarks.
arXiv Detail & Related papers (2025-05-01T17:03:17Z)
Generating Planning Feedback for Open-Ended Programming Exercises with LLMs [1.2499537119440245]
Large language models (LLM) may be able to generate feedback by detecting the overall code structure even for submissions with syntax errors. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy. LLM may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps.
arXiv Detail & Related papers (2025-04-11T20:26:49Z)
LLM-based Cognitive Models of Students with Misconceptions [55.29525439159345]
This paper investigates whether Large Language Models (LLMs) can be instruction-tuned to meet this dual requirement. We introduce MalAlgoPy, a novel Python library that generates datasets reflecting authentic student solution patterns. Our insights enhance our understanding of AI-based student models and pave the way for effective adaptive learning systems.
arXiv Detail & Related papers (2024-10-16T06:51:09Z)
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors [78.53699244846285]
Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. LLMs struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions.
arXiv Detail & Related papers (2024-07-12T10:11:40Z)
Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction [35.01097297297534]
Existing evaluations of Large Language Models (LLMs) focus on problem-solving from the examinee perspective. We define four evaluation tasks for error identification and correction along with a new dataset with annotated error types and steps. Our principal findings indicate that GPT-4 outperforms all models, while open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source models GPT-3.5 and Gemini Pro.
arXiv Detail & Related papers (2024-06-02T14:16:24Z)
Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge [4.981275578987307]
Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied.
arXiv Detail & Related papers (2024-05-08T17:57:39Z)
Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs) This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z)
Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z)
Improving the Validity of Automatically Generated Feedback via Reinforcement Learning [46.667783153759636]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)<n>Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z)
Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z)
Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey [21.01561950216472]
Modern language models (LMs) have been successfully employed in source code generation and understanding. Despite their great potential, language models for code intelligence (LM4Code) are susceptible to potential pitfalls.
arXiv Detail & Related papers (2023-10-27T05:32:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.