Feedback-Generation for Programming Exercises With GPT-4
- URL: http://arxiv.org/abs/2403.04449v2
- Date: Thu, 4 Jul 2024 07:30:22 GMT
- Title: Feedback-Generation for Programming Exercises With GPT-4
- Authors: Imen Azaiz, Natalie Kiesler, Sven Strickroth,
- Abstract summary: This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input.
The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ever since Large Language Models (LLMs) and related applications have become broadly available, several studies investigated their potential for assisting educators and supporting students in higher education. LLMs such as Codex, GPT-3.5, and GPT 4 have shown promising results in the context of large programming courses, where students can benefit from feedback and hints if provided timely and at scale. This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen, authentic student programming submissions. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements. For example, the output is more structured and consistent. GPT-4 Turbo can also accurately identify invalid casing in student programs' output. In some cases, the feedback also includes the output of the student program. At the same time, inconsistent feedback was noted such as stating that the submission is correct but an error needs to be fixed. The present work increases our understanding of LLMs' potential, limitations, and how to integrate them into e-assessment systems, pedagogical scenarios, and instructing students who are using applications based on GPT-4.
Related papers
- See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop.
Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances.
We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z) - Leveraging Lecture Content for Improved Feedback: Explorations with GPT-4 and Retrieval Augmented Generation [0.0]
This paper presents the use of Retrieval Augmented Generation to improve the feedback generated by Large Language Models for programming tasks.
corresponding lecture recordings were transcribed and made available to the Large Language Model GPT-4 as external knowledge source.
The purpose of this is to prevent hallucinations and to enforce the use of the technical terms and phrases from the lecture.
arXiv Detail & Related papers (2024-05-05T18:32:06Z) - Evaluating the Application of Large Language Models to Generate Feedback in Programming Education [0.0]
This study investigates the application of large language models, specifically GPT-4, to enhance programming education.
The research outlines the design of a web application that uses GPT-4 to provide feedback on programming tasks, without giving away the solution.
arXiv Detail & Related papers (2024-03-13T23:14:35Z) - Improving the Validity of Automatically Generated Feedback via
Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)
Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - Real Customization or Just Marketing: Are Customized Versions of Chat
GPT Useful? [0.0]
OpenAI has launched the possibility to fine-tune their model with a natural language web interface.
This research is to assess the potential of the customized GPTs that have recently been launched by OpenAI.
arXiv Detail & Related papers (2023-11-27T15:46:15Z) - GPT-4 as an interface between researchers and computational software:
improving usability and reproducibility [44.99833362998488]
We focus on a widely used software for molecular dynamics simulations.
We quantify the usefulness of input files generated by GPT-4 from task descriptions in English.
We find that GPT-4 can generate correct and ready-to-use input files for relatively simple tasks.
In addition, GPT-4's description of computational tasks from input files can be tuned from a detailed set of step-by-step instructions to a summary description appropriate for publications.
arXiv Detail & Related papers (2023-10-04T14:25:39Z) - Large Language Models (GPT) for automating feedback on programming
assignments [0.0]
We employ OpenAI's GPT-3.5 model to generate personalized hints for students solving programming assignments.
Students rated the usefulness of GPT-generated hints positively.
arXiv Detail & Related papers (2023-06-30T21:57:40Z) - Thrilled by Your Progress! Large Language Models (GPT-4) No Longer
Struggle to Pass Assessments in Higher Education Programming Courses [0.0]
GPT models evolved from completely failing the typical programming class' assessments to confidently passing the courses with no human involvement.
This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use technology that can be utilized by learners to collect passing scores.
arXiv Detail & Related papers (2023-06-15T22:12:34Z) - Generalized Planning in PDDL Domains with Pretrained Large Language
Models [82.24479434984426]
We consider PDDL domains and use GPT-4 to synthesize Python programs.
We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines.
arXiv Detail & Related papers (2023-05-18T14:48:20Z) - Instruction Tuning with GPT-4 [107.55078894215798]
We present the first attempt to use GPT-4 to generate instruction-following data for finetuning large language models.
Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks.
arXiv Detail & Related papers (2023-04-06T17:58:09Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.