Related papers: Using GPT-4 to Augment Unbalanced Data for Automatic Scoring

Using GPT-4 to Augment Unbalanced Data for Automatic Scoring

URL: http://arxiv.org/abs/2310.18365v2
Date: Sat, 18 Nov 2023 02:05:27 GMT
Title: Using GPT-4 to Augment Unbalanced Data for Automatic Scoring
Authors: Luyang Fang, Gyeong-Geon Lee and Xiaoming Zhai
Abstract summary: We introduce a novel text data augmentation framework using GPT-4, a generative large language model. We crafted prompts for GPT-4 to generate responses resembling student-written answers, particularly for minority scoring classes. We finetuned DistillBERT for automatic scoring based on the augmented and original datasets.
Score: 0.6278186810520364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning-based automatic scoring can be challenging if students' responses are unbalanced across scoring categories, as it introduces uncertainty in the machine training process. To meet this challenge, we introduce a novel text data augmentation framework using GPT-4, a generative large language model, specifically tailored for unbalanced datasets in automatic scoring. Our experimental dataset comprised student-written responses to two science items. We crafted prompts for GPT-4 to generate responses resembling student-written answers, particularly for the minority scoring classes, to augment the data. We then finetuned DistillBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and F1 score. We incorporate varied amounts of augmented data to examine scoring performance, and our findings revealed remarkedly improved model performance. The average maximum increase observed across two items is: 3.5% for accuracy, 30.6% for precision, 21.1% for recall, and 24.2% for F1 score. Notably, using just 5% of the augmented data led to substantial improvements: 2.6%, 29.2%, 15.1%, and 19.6%. Interestingly, the extent of improvement varied depending on specific datasets. Moreover, we found that a varying amount of augmented data (5%-40%) was needed to obtain a stable improvement. We also compare models trained with GPT-4 augmented data and those trained with additional student-written responses. The findings indicate that former ones match or even exceed the performance of the latter. Specifically, there is an average difference of 1.7%, 1.9%, 11.0%, and 7.8% for four metrics separately. This research underscores the potential and effectiveness of data augmentation techniques utilizing GPT-4 in addressing unbalanced datasets within automated assessment.

Related papers

Anyprefer: An Agentic Framework for Preference Data Synthesis [62.3856754548222]
We propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model. external tools are introduced to assist the judge model in accurately rewarding the target model's responses. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs.
arXiv Detail & Related papers (2025-04-27T15:21:59Z)
Language Models are Few-Shot Graders [0.12289361708127876]
We present an ASAG pipeline leveraging state-of-the-art LLMs. We compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection.
arXiv Detail & Related papers (2025-02-18T23:38:21Z)
Data Augmentation to Improve Large Language Models in Food Hazard and Product Detection [0.0]
The primary objective of this study is to demonstrate the impact of data augmentation using ChatGPT-4o-mini on food hazard and product analysis. The augmented data is generated using ChatGPT-4o-mini and subsequently used to train two large language models: RoBERTa-base and Flan-T5-base. The results indicate that using augmented data helped improve model performance across key metrics, including recall, F1 score, precision, and accuracy.
arXiv Detail & Related papers (2025-02-12T12:14:35Z)
Phi-4 Technical Report [72.06109095293243]
We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process.
arXiv Detail & Related papers (2024-12-12T03:37:41Z)
Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection [6.454528834218153]
RIOLU is fully automated, automatically parameterized, and does not need labeled samples. RIOLU can generate precise patterns from datasets in various domains, with a high F1 score of 97.2%. A variant of RIOLU, with user guidance, can further boost its precision, with up to 37.4% improvement in terms of F1.
arXiv Detail & Related papers (2024-12-06T18:18:26Z)
Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z)
Leveraging Web-Crawled Data for High-Quality Fine-Tuning [24.19939701706869]
We argue that web-crawled data can still serve as a valuable source for high-quality supervised fine-tuning without relying on advanced models like GPT-4. We create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems.
arXiv Detail & Related papers (2024-08-15T08:12:52Z)
Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4 [23.856839017006386]
Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud services. GPT-4 model's immense size presents challenges when trying to fine-tune it on user data. We propose an in-context learning approach for automated root causing, which eliminates the need for fine-tuning.
arXiv Detail & Related papers (2024-01-24T21:02:07Z)
NERIF: GPT-4V for Automatic Scoring of Drawn Models [0.6278186810520364]
Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices. We developed a method employing instructional note and rubrics to prompt GPT-4V to score students' drawn models. GPT-4V scores were compared with human experts' scores to calculate scoring accuracy.
arXiv Detail & Related papers (2023-11-21T20:52:04Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
TRIAGE: Characterizing and auditing training data for improved regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors. TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score. We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z)
Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z)
Instruction Tuning with GPT-4 [107.55078894215798]
We present the first attempt to use GPT-4 to generate instruction-following data for finetuning large language models. Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks.
arXiv Detail & Related papers (2023-04-06T17:58:09Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
Improving Performance of Automated Essay Scoring by using back-translation essays and adjusted scores [0.0]
We propose a method to increase the number of essay-score pairs using back-translation and score adjustment. We evaluate the effectiveness of the augmented data using models from prior work. The performance of the models was improved by using augmented data to train the models.
arXiv Detail & Related papers (2022-03-01T11:05:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.