AI-enhanced Auto-correction of Programming Exercises: How Effective is
GPT-3.5?
- URL: http://arxiv.org/abs/2311.10737v1
- Date: Tue, 24 Oct 2023 10:35:36 GMT
- Title: AI-enhanced Auto-correction of Programming Exercises: How Effective is
GPT-3.5?
- Authors: Imen Azaiz, Oliver Deckarm, Sven Strickroth
- Abstract summary: This paper investigates the potential of AI in providing personalized code correction and generating feedback.
GPT-3.5 exhibited weaknesses in its evaluation, including localization of errors that were not the actual errors, or even hallucinated errors.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Timely formative feedback is considered as one of the most important drivers
for effective learning. Delivering timely and individualized feedback is
particularly challenging in large classes in higher education. Recently Large
Language Models such as GPT-3 became available to the public that showed
promising results on various tasks such as code generation and code
explanation. This paper investigates the potential of AI in providing
personalized code correction and generating feedback. Based on existing student
submissions of two different real-world assignments, the correctness of the
AI-aided e-assessment as well as the characteristics such as fault
localization, correctness of hints, and code style suggestions of the generated
feedback are investigated. The results show that 73 % of the submissions were
correctly identified as either correct or incorrect. In 59 % of these cases,
GPT-3.5 also successfully generated effective and high-quality feedback.
Additionally, GPT-3.5 exhibited weaknesses in its evaluation, including
localization of errors that were not the actual errors, or even hallucinated
errors. Implications and potential new usage scenarios are discussed.
Related papers
- LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - How Can I Improve? Using GPT to Highlight the Desired and Undesired Parts of Open-ended Responses [11.809647985607935]
We explore a sequence labeling approach focused on identifying components of desired and less desired praise for providing explanatory feedback.
To quantify the quality of highlighted praise components identified by GPT models, we introduced a Modified Intersection over Union (M-IoU) score.
Our findings demonstrate that: (1) the M-IoU score effectively correlates with human judgment in evaluating sequence quality; (2) using two-shot prompting on GPT-3.5 resulted in decent performance in recognizing effort-based and outcome-based praise; and (3) our optimally fine-tuned GPT-3.5 model achieved M-IoU scores of 0.6
arXiv Detail & Related papers (2024-05-01T02:59:10Z) - Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input.
The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z) - Improving the Validity of Automatically Generated Feedback via
Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)
Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation [25.317788211120362]
We investigate the role of generative AI models in providing human tutor-style programming hints.
Recent works have benchmarked state-of-the-art models for various feedback generation scenarios.
We develop a novel technique, GPT4Hints-GPT3.5Val, to push the limits of generative AI models.
arXiv Detail & Related papers (2023-10-05T17:02:59Z) - Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias [57.42417061979399]
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically.
In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs.
Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families.
arXiv Detail & Related papers (2023-08-01T01:39:25Z) - Large Language Models (GPT) for automating feedback on programming
assignments [0.0]
We employ OpenAI's GPT-3.5 model to generate personalized hints for students solving programming assignments.
Students rated the usefulness of GPT-generated hints positively.
arXiv Detail & Related papers (2023-06-30T21:57:40Z) - The Unreliability of Explanations in Few-Shot In-Context Learning [50.77996380021221]
We focus on two NLP tasks that involve reasoning over text, namely question answering and natural language inference.
We show that explanations judged as good by humans--those that are logically consistent with the input--usually indicate more accurate predictions.
We present a framework for calibrating model predictions based on the reliability of the explanations.
arXiv Detail & Related papers (2022-05-06T17:57:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.