Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher
Education Programming Courses?
- URL: http://arxiv.org/abs/2303.09325v1
- Date: Thu, 16 Mar 2023 13:58:45 GMT
- Title: Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher
Education Programming Courses?
- Authors: Jaromir Savelka, Arav Agarwal, Christopher Bogart, Yifan Song, Majd
Sakr
- Abstract summary: We evaluated the capability of generative pre-trained transformers (GPT) to pass assessments in Python programming courses at the postsecondary level.
We studied if and how successfully GPT models leverage feedback provided by an auto-grader.
It is clear that a straightforward application of these easily accessible models could enable a learner to obtain a non-trivial portion of the overall available score.
- Score: 6.2122699483618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We evaluated the capability of generative pre-trained transformers (GPT), to
pass assessments in introductory and intermediate Python programming courses at
the postsecondary level. Discussions of potential uses (e.g., exercise
generation, code explanation) and misuses (e.g., cheating) of this emerging
technology in programming education have intensified, but to date there has not
been a rigorous analysis of the models' capabilities in the realistic context
of a full-fledged programming course with diverse set of assessment
instruments. We evaluated GPT on three Python courses that employ assessments
ranging from simple multiple-choice questions (no code involved) to complex
programming projects with code bases distributed into multiple files (599
exercises overall). Further, we studied if and how successfully GPT models
leverage feedback provided by an auto-grader. We found that the current models
are not capable of passing the full spectrum of assessments typically involved
in a Python programming course (<70% on even entry-level modules). Yet, it is
clear that a straightforward application of these easily accessible models
could enable a learner to obtain a non-trivial portion of the overall available
score (>55%) in introductory and intermediate courses alike. While the models
exhibit remarkable capabilities, including correcting solutions based on
auto-grader's feedback, some limitations exist (e.g., poor handling of
exercises requiring complex chains of reasoning steps). These findings can be
leveraged by instructors wishing to adapt their assessments so that GPT becomes
a valuable assistant for a learner as opposed to an end-to-end solution.
Related papers
- Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [175.9723801486487]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.
GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.
Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z) - From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer
Multiple-choice Questions for Programming Classes in Higher Education [2.6626950367610402]
We explore the evolving efficacy of three generative pre-trained transformer (GPT) models in generating answers for multiple-choice questions (MCQ)
We focus on the differences in capabilities of the models prior to the release of ChatGPT (Nov '22), at the time of the release, and today (i.e., Aug '23)
arXiv Detail & Related papers (2023-11-16T02:46:15Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Thrilled by Your Progress! Large Language Models (GPT-4) No Longer
Struggle to Pass Assessments in Higher Education Programming Courses [0.0]
GPT models evolved from completely failing the typical programming class' assessments to confidently passing the courses with no human involvement.
This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use technology that can be utilized by learners to collect passing scores.
arXiv Detail & Related papers (2023-06-15T22:12:34Z) - Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions
about Code [0.0]
We analyzed effectiveness of three generative pre-trained transformer (GPT) models in answering multiple-choice question (MCQ) assessments.
These findings can be leveraged by educators to adapt their instructional practices and assessments in programming courses.
arXiv Detail & Related papers (2023-03-09T16:52:12Z) - Learning Label Modular Prompts for Text Classification in the Wild [56.66187728534808]
We propose text classification in-the-wild, which introduces different non-stationary training/testing stages.
Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment.
We propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks.
arXiv Detail & Related papers (2022-11-30T16:26:38Z) - problexity -- an open-source Python library for binary classification
problem complexity assessment [0.0]
The classification problem's complexity assessment is an essential element of many topics in the supervised learning domain.
The tools currently available for the academic community, which would enable the calculation of problem complexity measures, are available only as libraries of the C++ and R languages.
This paper describes the software module that allows for the estimation of 22 complexity measures for the Python language.
arXiv Detail & Related papers (2022-07-14T07:32:15Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification.
A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors.
Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.