Related papers: Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings

Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings

URL: http://arxiv.org/abs/2308.02575v1
Date: Thu, 3 Aug 2023 12:47:17 GMT
Title: Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings
Authors: Veronika Hackl, Alexandra Elena M\"uller, Michael Granitzer, Maximilian Sailer
Abstract summary: This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4. The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
Score: 63.35165397320137
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4, a state-of-the-art artificial intelligence language model, across multiple iterations, time spans and stylistic variations. The model rated responses to tasks within the Higher Education (HE) subject domain of macroeconomics in terms of their content and style. Statistical analysis was conducted in order to learn more about the interrater reliability, consistency of the ratings across iterations and the correlation between ratings in terms of content and style. The results revealed a high interrater reliability with ICC scores ranging between 0.94 and 0.99 for different timespans, suggesting that GPT-4 is capable of generating consistent ratings across repetitions with a clear prompt. Style and content ratings show a high correlation of 0.87. When applying a non-adequate style the average content ratings remained constant, while style ratings decreased, which indicates that the large language model (LLM) effectively distinguishes between these two criteria during evaluation. The prompt used in this study is furthermore presented and explained. Further research is necessary to assess the robustness and reliability of AI models in various use cases.

Related papers

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education [0.30158609733245967]
Five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, were investigated for automated essay scoring in a higher education context.<n>A total of 67 Italian-language student essays were evaluated using a four-criterion rubric.<n>Human-LLM agreement was consistently low and non-significant, and within-model reliability across replications was similarly weak.
arXiv Detail & Related papers (2025-08-04T14:02:12Z)
Assessing the Reliability and Validity of GPT-4 in Annotating Emotion Appraisal Ratings [0.6008132390640295]
This paper studies GPT-4 as a reader-annotator of 21 specific appraisal ratings in different prompt settings. We found that GPT-4 is an effective reader-annotator that performs close to or even slightly better than human annotators.
arXiv Detail & Related papers (2025-03-21T06:35:49Z)
Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models [1.6874375111244329]
We explore the collaborative dynamics of an innovative language model interaction system involving advanced models. These models generate and answer complex, PhD-level statistical questions without exact ground-truth answers. Our study investigates how inter-model consensus enhances the reliability and precision of responses.
arXiv Detail & Related papers (2024-11-25T10:18:17Z)
Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering [5.160473221022088]
This study explores the feasibility of using large language models (LLMs) for automated grading of conceptual questions. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University. Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers.
arXiv Detail & Related papers (2024-11-06T04:41:13Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation [70.27452774899189]
Large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. As of November 2023, state-of-the-art LLMs do not provide access to these probabilities. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets.
arXiv Detail & Related papers (2023-11-15T11:27:44Z)
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information. We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols. We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.