Related papers: From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer Multiple-choice Questions for Programming Classes in Higher Education

From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer Multiple-choice Questions for Programming Classes in Higher Education

URL: http://arxiv.org/abs/2311.09518v1
Date: Thu, 16 Nov 2023 02:46:15 GMT
Title: From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer Multiple-choice Questions for Programming Classes in Higher Education
Authors: Jaromir Savelka, Arav Agarwal, Christopher Bogart, Majd Sakr
Abstract summary: We explore the evolving efficacy of three generative pre-trained transformer (GPT) models in generating answers for multiple-choice questions (MCQ) We focus on the differences in capabilities of the models prior to the release of ChatGPT (Nov '22), at the time of the release, and today (i.e., Aug '23)
Score: 2.6626950367610402
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We explore the evolving efficacy of three generative pre-trained transformer (GPT) models in generating answers for multiple-choice questions (MCQ) from introductory and intermediate Python programming courses in higher education. We focus on the differences in capabilities of the models prior to the release of ChatGPT (Nov '22), at the time of the release, and today (i.e., Aug '23). Recent studies have established that the abilities of the OpenAI's GPT models to handle assessments originally designed for humans keep increasing as the newer more capable models are released. However, the qualitative differences in the capabilities and limitations of these models to reason about and/or analyze programming MCQs have been under-explored. We evaluated three OpenAI's GPT models on formative and summative MCQ assessments from three Python courses (530 questions) focusing on the qualitative differences in the evolving efficacy of the subsequent models. This study provides further evidence and insight into the trajectory of the current developments where there already exists a technology that can be utilized by students to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments. This study could be leveraged by educators and institutions to better understand the recent technological developments in order to adapt the design of programming assessments as well as to fuel the necessary discussions into how assessments in future programming classes should be updated.

Related papers

Tulu 3: Pushing Frontiers in Open Language Model Post-Training [94.14908801708049]
Tulu 3 is a family of fully-open state-of-the-art post-trained models. Tulu 3 builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku.
arXiv Detail & Related papers (2024-11-22T18:44:04Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [175.9723801486487]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions. GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions. Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z)
LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation [25.317788211120362]
We investigate the role of generative AI models in providing human tutor-style programming hints. Recent works have benchmarked state-of-the-art models for various feedback generation scenarios. We develop a novel technique, GPT4Hints-GPT3.5Val, to push the limits of generative AI models.
arXiv Detail & Related papers (2023-10-05T17:02:59Z)
PILOT: A Pre-Trained Model-Based Continual Learning Toolbox [71.63186089279218]
This paper introduces a pre-trained model-based continual learning toolbox known as PILOT. On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt. On the other hand, PILOT fits typical class-incremental learning algorithms within the context of pre-trained models to evaluate their effectiveness.
arXiv Detail & Related papers (2023-09-13T17:55:11Z)
Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors [21.227955181065948]
We systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios.
arXiv Detail & Related papers (2023-06-29T17:57:40Z)
Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses [0.0]
GPT models evolved from completely failing the typical programming class' assessments to confidently passing the courses with no human involvement. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use technology that can be utilized by learners to collect passing scores.
arXiv Detail & Related papers (2023-06-15T22:12:34Z)
Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review [5.329514340780243]
Large language models (LLMs) have the potential to automate the laborious process of generating and analysing textual content. There are concerns regarding the practicality and ethicality of these innovations. We conducted a systematic scoping review of 118 peer-reviewed papers published since 2017 to pinpoint the current state of research.
arXiv Detail & Related papers (2023-03-17T18:14:46Z)
Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? [6.2122699483618]
We evaluated the capability of generative pre-trained transformers (GPT) to pass assessments in Python programming courses at the postsecondary level. We studied if and how successfully GPT models leverage feedback provided by an auto-grader. It is clear that a straightforward application of these easily accessible models could enable a learner to obtain a non-trivial portion of the overall available score.
arXiv Detail & Related papers (2023-03-16T13:58:45Z)
Image Quality Assessment in the Modern Age [53.19271326110551]
This tutorial provides the audience with the basic theories, methodologies, and current progresses of image quality assessment (IQA) We will first revisit several subjective quality assessment methodologies, with emphasis on how to properly select visual stimuli. Both hand-engineered and (deep) learning-based methods will be covered.
arXiv Detail & Related papers (2021-10-19T02:38:46Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.