Efficient Classification of Student Help Requests in Programming Courses
Using Large Language Models
- URL: http://arxiv.org/abs/2310.20105v1
- Date: Tue, 31 Oct 2023 00:56:33 GMT
- Title: Efficient Classification of Student Help Requests in Programming Courses
Using Large Language Models
- Authors: Jaromir Savelka, Paul Denny, Mark Liffiton, Brad Sheese
- Abstract summary: This study evaluates the performance of the GPT-3.5 and GPT-4 models for classifying help requests from students in an introductory programming class.
Fine-tuning the GPT-3.5 model improved its performance to such an extent that it approximated the accuracy and consistency across categories observed between two human raters.
- Score: 2.5949084781328744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The accurate classification of student help requests with respect to the type
of help being sought can enable the tailoring of effective responses.
Automatically classifying such requests is non-trivial, but large language
models (LLMs) appear to offer an accessible, cost-effective solution. This
study evaluates the performance of the GPT-3.5 and GPT-4 models for classifying
help requests from students in an introductory programming class. In zero-shot
trials, GPT-3.5 and GPT-4 exhibited comparable performance on most categories,
while GPT-4 outperformed GPT-3.5 in classifying sub-categories for requests
related to debugging. Fine-tuning the GPT-3.5 model improved its performance to
such an extent that it approximated the accuracy and consistency across
categories observed between two human raters. Overall, this study demonstrates
the feasibility of using LLMs to enhance educational systems through the
automated classification of student needs.
Related papers
- Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions [2.0411082897313984]
This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math.
By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated'student' model.
arXiv Detail & Related papers (2024-06-20T00:25:43Z) - How Can I Improve? Using GPT to Highlight the Desired and Undesired Parts of Open-ended Responses [11.809647985607935]
We explore a sequence labeling approach focused on identifying components of desired and less desired praise for providing explanatory feedback.
To quantify the quality of highlighted praise components identified by GPT models, we introduced a Modified Intersection over Union (M-IoU) score.
Our findings demonstrate that: (1) the M-IoU score effectively correlates with human judgment in evaluating sequence quality; (2) using two-shot prompting on GPT-3.5 resulted in decent performance in recognizing effort-based and outcome-based praise; and (3) our optimally fine-tuned GPT-3.5 model achieved M-IoU scores of 0.6
arXiv Detail & Related papers (2024-05-01T02:59:10Z) - Exploiting GPT-4 Vision for Zero-shot Point Cloud Understanding [114.4754255143887]
We tackle the challenge of classifying the object category in point clouds.
We employ GPT-4 Vision (GPT-4V) to overcome these challenges.
We set a new benchmark in zero-shot point cloud classification.
arXiv Detail & Related papers (2024-01-15T10:16:44Z) - Which AI Technique Is Better to Classify Requirements? An Experiment with SVM, LSTM, and ChatGPT [0.4588028371034408]
This paper reports an empirical evaluation of two ChatGPT models for requirements classification.
Our results show that there is no single best technique for all types of requirement classes.
The few-shot setting has been found to be beneficial primarily in scenarios where zero-shot results are significantly low.
arXiv Detail & Related papers (2023-11-20T05:55:05Z) - Investigating the Limitation of CLIP Models: The Worst-Performing
Categories [53.360239882501325]
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts.
It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts.
However, we found that their performance in the worst categories is significantly inferior to the overall performance.
arXiv Detail & Related papers (2023-10-05T05:37:33Z) - Performance of the Pre-Trained Large Language Model GPT-4 on Automated
Short Answer Grading [0.0]
We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle.
We found that the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.
arXiv Detail & Related papers (2023-09-17T18:04:34Z) - Stay on topic with Classifier-Free Guidance [57.28934343207042]
We show that CFG can be used broadly as an inference-time technique in pure language modeling.
We show that CFG improves the performance of Pythia, GPT-2 and LLaMA-family models across an array of tasks.
arXiv Detail & Related papers (2023-06-30T17:07:02Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - Guiding Generative Language Models for Data Augmentation in Few-Shot
Text Classification [59.698811329287174]
We leverage GPT-2 for generating artificial training instances in order to improve classification performance.
Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements.
arXiv Detail & Related papers (2021-11-17T12:10:03Z) - Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models.
Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity.
The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.