Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise
Given to Students in Synthetic Dialogues
- URL: http://arxiv.org/abs/2307.02018v1
- Date: Wed, 5 Jul 2023 04:14:01 GMT
- Title: Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise
Given to Students in Synthetic Dialogues
- Authors: Dollaya Hirunyasiri, Danielle R. Thomas, Jionghao Lin, Kenneth R.
Koedinger, Vincent Aleven
- Abstract summary: Large language models, such as the AI-chatbot ChatGPT, hold potential for offering constructive feedback to tutors in practical settings.
The accuracy of AI-generated feedback remains uncertain, with scant research investigating the ability of models like ChatGPT to deliver effective feedback.
- Score: 2.3361634876233817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research suggests that providing specific and timely feedback to human tutors
enhances their performance. However, it presents challenges due to the
time-consuming nature of assessing tutor performance by human evaluators. Large
language models, such as the AI-chatbot ChatGPT, hold potential for offering
constructive feedback to tutors in practical settings. Nevertheless, the
accuracy of AI-generated feedback remains uncertain, with scant research
investigating the ability of models like ChatGPT to deliver effective feedback.
In this work-in-progress, we evaluate 30 dialogues generated by GPT-4 in a
tutor-student setting. We use two different prompting approaches, the zero-shot
chain of thought and the few-shot chain of thought, to identify specific
components of effective praise based on five criteria. These approaches are
then compared to the results of human graders for accuracy. Our goal is to
assess the extent to which GPT-4 can accurately identify each praise criterion.
We found that both zero-shot and few-shot chain of thought approaches yield
comparable results. GPT-4 performs moderately well in identifying instances
when the tutor offers specific and immediate praise. However, GPT-4
underperforms in identifying the tutor's ability to deliver sincere praise,
particularly in the zero-shot prompting scenario where examples of sincere
tutor praise statements were not provided. Future work will focus on enhancing
prompt engineering, developing a more general tutoring rubric, and evaluating
our method using real-life tutoring dialogues.
Related papers
- Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [92.66784679667441]
Prometheus 2 is a more powerful evaluator LM that closely mirrors human and GPT-4 judgements.
It is capable of processing both direct assessment and pairwise ranking formats grouped with a user-defined evaluation criteria.
On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges.
arXiv Detail & Related papers (2024-05-02T17:59:35Z) - How Can I Get It Right? Using GPT to Rephrase Incorrect Trainee Responses [2.2077346768771653]
One-on-one tutoring is widely acknowledged as an effective instructional method, conditioned on qualified tutors.
The GPT-4 model was employed to build an explanatory feedback system.
This system identifies trainees' responses in binary form (i.e., correct/incorrect) and automatically provides template-based feedback with responses appropriately rephrased by the GPT-4 model.
arXiv Detail & Related papers (2024-05-02T03:18:03Z) - How Can I Improve? Using GPT to Highlight the Desired and Undesired Parts of Open-ended Responses [11.809647985607935]
We explore a sequence labeling approach focused on identifying components of desired and less desired praise for providing explanatory feedback.
To quantify the quality of highlighted praise components identified by GPT models, we introduced a Modified Intersection over Union (M-IoU) score.
Our findings demonstrate that: (1) the M-IoU score effectively correlates with human judgment in evaluating sequence quality; (2) using two-shot prompting on GPT-3.5 resulted in decent performance in recognizing effort-based and outcome-based praise; and (3) our optimally fine-tuned GPT-3.5 model achieved M-IoU scores of 0.6
arXiv Detail & Related papers (2024-05-01T02:59:10Z) - Automated Assessment of Encouragement and Warmth in Classrooms Leveraging Multimodal Emotional Features and ChatGPT [7.273857543125784]
Our work explores a multimodal approach to automatically estimating encouragement and warmth in classrooms.
We employed facial and speech emotion recognition with sentiment analysis to extract interpretable features from video, audio, and transcript data.
We demonstrated our approach on the GTI dataset, comprising 367 16-minute video segments from 92 authentic lesson recordings.
arXiv Detail & Related papers (2024-04-01T16:58:09Z) - Improving the Validity of Automatically Generated Feedback via
Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)
Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge.
We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z) - Distilling ChatGPT for Explainable Automated Student Answer Assessment [19.604476650824516]
We introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation.
Our experiments show that the proposed method improves the overall QWK score by 11% compared to ChatGPT.
arXiv Detail & Related papers (2023-05-22T12:11:39Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.