Related papers: Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?

Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?

URL: http://arxiv.org/abs/2303.09325v1
Date: Thu, 16 Mar 2023 13:58:45 GMT
Title: Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?
Authors: Jaromir Savelka, Arav Agarwal, Christopher Bogart, Yifan Song, Majd Sakr
Abstract summary: We evaluated the capability of generative pre-trained transformers (GPT) to pass assessments in Python programming courses at the postsecondary level. We studied if and how successfully GPT models leverage feedback provided by an auto-grader. It is clear that a straightforward application of these easily accessible models could enable a learner to obtain a non-trivial portion of the overall available score.
Score: 6.2122699483618
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We evaluated the capability of generative pre-trained transformers (GPT), to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. Discussions of potential uses (e.g., exercise generation, code explanation) and misuses (e.g., cheating) of this emerging technology in programming education have intensified, but to date there has not been a rigorous analysis of the models' capabilities in the realistic context of a full-fledged programming course with diverse set of assessment instruments. We evaluated GPT on three Python courses that employ assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Further, we studied if and how successfully GPT models leverage feedback provided by an auto-grader. We found that the current models are not capable of passing the full spectrum of assessments typically involved in a Python programming course (<70% on even entry-level modules). Yet, it is clear that a straightforward application of these easily accessible models could enable a learner to obtain a non-trivial portion of the overall available score (>55%) in introductory and intermediate courses alike. While the models exhibit remarkable capabilities, including correcting solutions based on auto-grader's feedback, some limitations exist (e.g., poor handling of exercises requiring complex chains of reasoning steps). These findings can be leveraged by instructors wishing to adapt their assessments so that GPT becomes a valuable assistant for a learner as opposed to an end-to-end solution.

Related papers

The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own [1.2891210250935148]
This paper presents a comprehensive investigation into the capability of Large Language Models (LLMs) to successfully complete a control systems course. We assess LLM performance using ChatGPT under a "minimal effort" protocol that simulates realistic student usage patterns. Our analysis provides quantitative insights into AI's strengths and limitations in handling mathematical formulations, coding challenges, and theoretical concepts in control systems engineering.
arXiv Detail & Related papers (2025-02-23T18:47:14Z)
Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [175.9723801486487]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions. GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions. Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z)
From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer Multiple-choice Questions for Programming Classes in Higher Education [2.6626950367610402]
We explore the evolving efficacy of three generative pre-trained transformer (GPT) models in generating answers for multiple-choice questions (MCQ) We focus on the differences in capabilities of the models prior to the release of ChatGPT (Nov '22), at the time of the release, and today (i.e., Aug '23)
arXiv Detail & Related papers (2023-11-16T02:46:15Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses [0.0]
GPT models evolved from completely failing the typical programming class' assessments to confidently passing the courses with no human involvement. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use technology that can be utilized by learners to collect passing scores.
arXiv Detail & Related papers (2023-06-15T22:12:34Z)
Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code [0.0]
We analyzed effectiveness of three generative pre-trained transformer (GPT) models in answering multiple-choice question (MCQ) assessments. These findings can be leveraged by educators to adapt their instructional practices and assessments in programming courses.
arXiv Detail & Related papers (2023-03-09T16:52:12Z)
Learning Label Modular Prompts for Text Classification in the Wild [56.66187728534808]
We propose text classification in-the-wild, which introduces different non-stationary training/testing stages. Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment. We propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks.
arXiv Detail & Related papers (2022-11-30T16:26:38Z)
problexity -- an open-source Python library for binary classification problem complexity assessment [0.0]
The classification problem's complexity assessment is an essential element of many topics in the supervised learning domain. The tools currently available for the academic community, which would enable the calculation of problem complexity measures, are available only as libraries of the C++ and R languages. This paper describes the software module that allows for the estimation of 22 complexity measures for the Python language.
arXiv Detail & Related papers (2022-07-14T07:32:15Z)
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning. During inference, we introduce a new generation procedure with a critical sampling strategy. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z)
ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification. A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.