NERIF: GPT-4V for Automatic Scoring of Drawn Models
- URL: http://arxiv.org/abs/2311.12990v2
- Date: Sun, 24 Dec 2023 04:23:29 GMT
- Title: NERIF: GPT-4V for Automatic Scoring of Drawn Models
- Authors: Gyeong-Geon Lee, and Xiaoming Zhai
- Abstract summary: Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices.
We developed a method employing instructional note and rubrics to prompt GPT-4V to score students' drawn models.
GPT-4V scores were compared with human experts' scores to calculate scoring accuracy.
- Score: 0.6278186810520364
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Scoring student-drawn models is time-consuming. Recently released GPT-4V
provides a unique opportunity to advance scientific modeling practices by
leveraging the powerful image processing capability. To test this ability
specifically for automatic scoring, we developed a method NERIF
(Notation-Enhanced Rubric Instruction for Few-shot Learning) employing
instructional note and rubrics to prompt GPT-4V to score students' drawn models
for science phenomena. We randomly selected a set of balanced data (N = 900)
that includes student-drawn models for six modeling assessment tasks. Each
model received a score from GPT-4V ranging at three levels: 'Beginning,'
'Developing,' or 'Proficient' according to scoring rubrics. GPT-4V scores were
compared with human experts' scores to calculate scoring accuracy. Results show
that GPT-4V's average scoring accuracy was mean =.51, SD = .037. Specifically,
average scoring accuracy was .64 for the 'Beginning' class, .62 for the
'Developing' class, and .26 for the 'Proficient' class, indicating that more
proficient models are more challenging to score. Further qualitative study
reveals how GPT-4V retrieves information from image input, including problem
context, example evaluations provided by human coders, and students' drawing
models. We also uncovered how GPT-4V catches the characteristics of
student-drawn models and narrates them in natural language. At last, we
demonstrated how GPT-4V assigns scores to student-drawn models according to the
given scoring rubric and instructional notes. Our findings suggest that the
NERIF is an effective approach for employing GPT-4V to score drawn models. Even
though there is space for GPT-4V to improve scoring accuracy, some mis-assigned
scores seemed interpretable to experts. The results of this study show that
utilizing GPT-4V for automatic scoring of student-drawn models is promising.
Related papers
- Self-Judge: Selective Instruction Following with Alignment Self-Evaluation [27.69410513313001]
We study the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low.
We introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores.
arXiv Detail & Related papers (2024-09-02T04:14:13Z) - An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4 [29.93673872618022]
Fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4.
We introduce a method, leveraging GPT-4 to compensate for the limitations and improve the fine-tuned judges.
Experiment results show our method achieves accuracy on par with GPT-4 with only 50% of the API expense.
arXiv Detail & Related papers (2024-03-05T10:20:52Z) - A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines.
More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z) - Gemini Pro Defeated by GPT-4V: Evidence from Education [1.0226894006814744]
GPT-4V significantly outperforms Gemini Pro in terms of scoring accuracy and Quadratic Weighted Kappa.
Findings suggest GPT-4V's superior capability in handling complex educational tasks.
arXiv Detail & Related papers (2023-12-27T02:56:41Z) - GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z) - Using GPT-4 to Augment Unbalanced Data for Automatic Scoring [0.5586073503694489]
We introduce a novel text data augmentation framework leveraging GPT-4, a generative large language model.
We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes.
We finetuned DistillBERT for automatic scoring based on the augmented and original datasets.
arXiv Detail & Related papers (2023-10-25T01:07:50Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z) - Instruction Tuning with GPT-4 [107.55078894215798]
We present the first attempt to use GPT-4 to generate instruction-following data for finetuning large language models.
Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks.
arXiv Detail & Related papers (2023-04-06T17:58:09Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.