Evaluating the Application of Large Language Models to Generate Feedback in Programming Education
- URL: http://arxiv.org/abs/2403.09744v1
- Date: Wed, 13 Mar 2024 23:14:35 GMT
- Title: Evaluating the Application of Large Language Models to Generate Feedback in Programming Education
- Authors: Sven Jacobs, Steffen Jaschke,
- Abstract summary: This study investigates the application of large language models, specifically GPT-4, to enhance programming education.
The research outlines the design of a web application that uses GPT-4 to provide feedback on programming tasks, without giving away the solution.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study investigates the application of large language models, specifically GPT-4, to enhance programming education. The research outlines the design of a web application that uses GPT-4 to provide feedback on programming tasks, without giving away the solution. A web application for working on programming tasks was developed for the study and evaluated with 51 students over the course of one semester. The results show that most of the feedback generated by GPT-4 effectively addressed code errors. However, challenges with incorrect suggestions and hallucinated issues indicate the need for further improvements.
Related papers
- Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.
Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z) - Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input.
The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z) - Improving the Validity of Automatically Generated Feedback via
Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)
Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - Comparing large language models and human programmers for generating programming code [0.0]
GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2.
In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants.
arXiv Detail & Related papers (2024-03-01T14:43:06Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - GPT-4 as an interface between researchers and computational software:
improving usability and reproducibility [44.99833362998488]
We focus on a widely used software for molecular dynamics simulations.
We quantify the usefulness of input files generated by GPT-4 from task descriptions in English.
We find that GPT-4 can generate correct and ready-to-use input files for relatively simple tasks.
In addition, GPT-4's description of computational tasks from input files can be tuned from a detailed set of step-by-step instructions to a summary description appropriate for publications.
arXiv Detail & Related papers (2023-10-04T14:25:39Z) - Evaluating ChatGPT and GPT-4 for Visual Programming [20.64766977405438]
We evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, in visual programming domains for various scenarios.
Our results show that these models perform poorly and struggle to combine spatial, logical, and programming skills crucial for visual programming.
arXiv Detail & Related papers (2023-07-30T22:13:20Z) - Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4,
and Human Tutors [21.227955181065948]
We systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios.
Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios.
arXiv Detail & Related papers (2023-06-29T17:57:40Z) - Thrilled by Your Progress! Large Language Models (GPT-4) No Longer
Struggle to Pass Assessments in Higher Education Programming Courses [0.0]
GPT models evolved from completely failing the typical programming class' assessments to confidently passing the courses with no human involvement.
This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use technology that can be utilized by learners to collect passing scores.
arXiv Detail & Related papers (2023-06-15T22:12:34Z) - Generalized Planning in PDDL Domains with Pretrained Large Language
Models [82.24479434984426]
We consider PDDL domains and use GPT-4 to synthesize Python programs.
We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines.
arXiv Detail & Related papers (2023-05-18T14:48:20Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.