Fine-tuning ChatGPT for Automatic Scoring
- URL: http://arxiv.org/abs/2310.10072v3
- Date: Tue, 26 Dec 2023 01:13:11 GMT
- Title: Fine-tuning ChatGPT for Automatic Scoring
- Authors: Ehsan Latif and Xiaoming Zhai
- Abstract summary: This study highlights the potential of fine-tuned ChatGPT (GPT3.5) for automatically scoring student written constructed responses.
We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT.
- Score: 1.4833692070415454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study highlights the potential of fine-tuned ChatGPT (GPT-3.5) for
automatically scoring student written constructed responses using example
assessment tasks in science education. Recent studies on OpenAI's generative
model GPT-3.5 proved its superiority in predicting the natural language with
high accuracy and human-like responses. GPT-3.5 has been trained over enormous
online language materials such as journals and Wikipedia; therefore, more than
direct usage of pre-trained GPT-3.5 is required for automatic scoring as
students utilize a different language than trained material. These imply that a
domain-specific model, fine-tuned over data for specific tasks, can enhance
model performance. In this study, we fine-tuned GPT-3.5 on six assessment tasks
with a diverse dataset of middle-school and high-school student responses and
expert scoring. The six tasks comprise two multi-label and four multi-class
assessment tasks. We compare the performance of fine-tuned GPT-3.5 with the
fine-tuned state-of-the-art Google's generated language model, BERT. The
results show that in-domain training corpora constructed from science questions
and responses for BERT achieved average accuracy = 0.838, SD = 0.069. GPT-3.5
shows a remarkable average increase (9.1%) in automatic scoring accuracy (mean
= 9.15, SD = 0.042) for the six tasks, p =0.001 < 0.05. Specifically, for
multi-label tasks (item 1 with 5 labels; item 2 with 10 labels), GPT-3.5
achieved significantly higher scoring accuracy than BERT across all the labels,
with the second item achieving a 7.1% increase. The average scoring increase
for the four multi-class items for GPT-3.5 was 10.6% compared to BERT. Our
study confirmed the effectiveness of fine-tuned GPT-3.5 for automatic scoring
of student responses on domain-specific data in education with high accuracy.
We have released fine-tuned models for public use and community engagement.
Related papers
- GPT-3.5 for Grammatical Error Correction [0.4757470449749875]
This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages.
We conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods.
For English, GPT-3.5 demonstrates high recall, generates fluent corrections, and generally preserves sentence semantics.
But, human evaluation for both English and Russian reveals that, despite its strong error-detection capabilities, GPT-3.5 struggles with several error types.
arXiv Detail & Related papers (2024-05-14T09:51:09Z) - Applying Large Language Models and Chain-of-Thought for Automatic
Scoring [23.076596289069506]
This study investigates the application of large language models (LLMs) in the automatic scoring of student-written responses to science assessments.
We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools.
arXiv Detail & Related papers (2023-11-30T21:22:43Z) - Using GPT-4 to Augment Unbalanced Data for Automatic Scoring [0.5586073503694489]
We introduce a novel text data augmentation framework leveraging GPT-4, a generative large language model.
We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes.
We finetuned DistillBERT for automatic scoring based on the augmented and original datasets.
arXiv Detail & Related papers (2023-10-25T01:07:50Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs.
BERT-based extraction methods require large amounts of task-specific training data.
This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - Exploring Small Language Models with Prompt-Learning Paradigm for
Efficient Domain-Specific Text Classification [2.410463233396231]
Small language models (SLMs) offer significant customizability, adaptability, and cost-effectiveness for domain-specific tasks.
In few-shot settings when prompt-based model fine-tuning is possible, T5-base, a typical SLM with 220M parameters, achieve approximately 75% accuracy with limited labeled data.
In zero-shot settings with a fixed model, we underscore a pivotal observation that, although the GPT-3.5-turbo equipped with around 154B parameters garners an accuracy of 55.16%, the power of well designed prompts becomes evident.
arXiv Detail & Related papers (2023-09-26T09:24:46Z) - InheritSumm: A General, Versatile and Compact Summarizer by Distilling
from GPT [75.29359361404073]
InheritSumm is a versatile and compact summarization model derived from GPT-3.5 through distillation.
It achieves similar or superior performance to GPT-3.5 in zeroshot and fewshot settings.
arXiv Detail & Related papers (2023-05-22T14:52:32Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Improving Short Text Classification With Augmented Data Using GPT-3 [0.0]
GPT-3 is a large-scale natural language model developed by OpenAI.
This study teaches GPT-3 to classify whether a question is related to data science by augmenting a small training set with additional examples.
We find that while the augmented Completion achieves upwards of 80 percent validation accuracy, using the augmented Classification yields more consistent accuracy on unseen examples.
arXiv Detail & Related papers (2022-05-23T01:10:38Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.