Using GPT-4 to Augment Unbalanced Data for Automatic Scoring
- URL: http://arxiv.org/abs/2310.18365v2
- Date: Sat, 18 Nov 2023 02:05:27 GMT
- Title: Using GPT-4 to Augment Unbalanced Data for Automatic Scoring
- Authors: Luyang Fang, Gyeong-Geon Lee and Xiaoming Zhai
- Abstract summary: We introduce a novel text data augmentation framework using GPT-4, a generative large language model.
We crafted prompts for GPT-4 to generate responses resembling student-written answers, particularly for minority scoring classes.
We finetuned DistillBERT for automatic scoring based on the augmented and original datasets.
- Score: 0.6278186810520364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning-based automatic scoring can be challenging if students'
responses are unbalanced across scoring categories, as it introduces
uncertainty in the machine training process. To meet this challenge, we
introduce a novel text data augmentation framework using GPT-4, a generative
large language model, specifically tailored for unbalanced datasets in
automatic scoring. Our experimental dataset comprised student-written responses
to two science items. We crafted prompts for GPT-4 to generate responses
resembling student-written answers, particularly for the minority scoring
classes, to augment the data. We then finetuned DistillBERT for automatic
scoring based on the augmented and original datasets. Model performance was
assessed using accuracy, precision, recall, and F1 score. We incorporate varied
amounts of augmented data to examine scoring performance, and our findings
revealed remarkedly improved model performance. The average maximum increase
observed across two items is: 3.5% for accuracy, 30.6% for precision, 21.1% for
recall, and 24.2% for F1 score. Notably, using just 5% of the augmented data
led to substantial improvements: 2.6%, 29.2%, 15.1%, and 19.6%. Interestingly,
the extent of improvement varied depending on specific datasets. Moreover, we
found that a varying amount of augmented data (5%-40%) was needed to obtain a
stable improvement. We also compare models trained with GPT-4 augmented data
and those trained with additional student-written responses. The findings
indicate that former ones match or even exceed the performance of the latter.
Specifically, there is an average difference of 1.7%, 1.9%, 11.0%, and 7.8% for
four metrics separately. This research underscores the potential and
effectiveness of data augmentation techniques utilizing GPT-4 in addressing
unbalanced datasets within automated assessment.
Related papers
- Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback [110.16220825629749]
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models.
In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts.
Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements.
arXiv Detail & Related papers (2024-06-13T16:17:21Z) - Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets.
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances.
Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z) - CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? [72.19502317793133]
We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP)
We present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases.
arXiv Detail & Related papers (2024-03-07T14:43:17Z) - Automated Root Causing of Cloud Incidents using In-Context Learning with
GPT-4 [23.856839017006386]
Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud services.
GPT-4 model's immense size presents challenges when trying to fine-tune it on user data.
We propose an in-context learning approach for automated root causing, which eliminates the need for fine-tuning.
arXiv Detail & Related papers (2024-01-24T21:02:07Z) - Applying Large Language Models and Chain-of-Thought for Automatic
Scoring [23.076596289069506]
This study investigates the application of large language models (LLMs) in the automatic scoring of student-written responses to science assessments.
We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools.
arXiv Detail & Related papers (2023-11-30T21:22:43Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Fine-tuning ChatGPT for Automatic Scoring [1.4833692070415454]
This study highlights the potential of fine-tuned ChatGPT (GPT3.5) for automatically scoring student written constructed responses.
We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT.
arXiv Detail & Related papers (2023-10-16T05:09:16Z) - NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification
Tasks [0.0]
Finetuning large language models inflates the costs of NLU applications.
Recent works in computer vision use data pruning to reduce training time.
We propose a curriculum which periodically scores and discards unimportant examples during finetuning.
arXiv Detail & Related papers (2023-06-05T19:30:41Z) - Augmentation-Aware Self-Supervision for Data-Efficient GAN Training [68.81471633374393]
Training generative adversarial networks (GANs) with limited data is challenging because the discriminator is prone to overfitting.
We propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of the augmented data.
We compare our method with state-of-the-art (SOTA) methods using the class-conditional BigGAN and unconditional StyleGAN2 architectures.
arXiv Detail & Related papers (2022-05-31T10:35:55Z) - Improving Auto-Augment via Augmentation-Wise Weight Sharing [123.71986174280741]
A key component of automatic augmentation search is the evaluation process for a particular augmentation policy.
In this paper, we dive into the dynamics of augmented training of the model.
We design a powerful and efficient proxy task based on the Augmentation-Wise Weight Sharing (AWS) to form a fast yet accurate evaluation process.
arXiv Detail & Related papers (2020-09-30T15:23:12Z) - DARE: Data Augmented Relation Extraction with GPT-2 [0.26651200086513094]
We present Data Augmented Relation Extraction(DARE), a simple method to augment training data by properly fine-tuning GPT-2.
DARE achieves new state of the art in three widely used biomedical RE datasets surpassing the previous best results by 4.7 F1 points on average.
arXiv Detail & Related papers (2020-04-06T14:38:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.