Related papers: Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

URL: http://arxiv.org/abs/2507.19980v2
Date: Tue, 29 Jul 2025 15:46:47 GMT
Title: Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory
Authors: Dan Song, Won-Chan Lee, Hong Jiao,
Abstract summary: This study investigates the estimation of reliability for large language models (LLMs) in scoring writing tasks from the AP Chinese Language and Culture exam.<n>Using generalizability theory, the research evaluates and compares score consistency between human and AI raters.<n> Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large-scale writing assessments.
Score: 2.5163150839708948
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study investigates the estimation of reliability for large language models (LLMs) in scoring writing tasks from the AP Chinese Language and Culture Exam. Using generalizability theory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free-response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corresponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large-scale writing assessments.

Related papers

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education [0.30158609733245967]
Five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, were investigated for automated essay scoring in a higher education context.<n>A total of 67 Italian-language student essays were evaluated using a four-criterion rubric.<n>Human-LLM agreement was consistently low and non-significant, and within-model reliability across replications was similarly weak.
arXiv Detail & Related papers (2025-08-04T14:02:12Z)
Machine-assisted writing evaluation: Exploring pre-trained language models in analyzing argumentative moves [28.01557438111706]
The study investigates the efficacy of pre-trained language models (PLMs) in analyzing argumentative moves in a longitudinal learner corpus.<n>A longitudinal corpus of 1643 argumentative texts from 235 English learners in China is collected and annotated into six move types.<n>The results indicate a robust reliability of PLMs in analyzing argumentative moves, with an overall F1 score of 0.743, surpassing existing models in the field.
arXiv Detail & Related papers (2025-03-25T02:21:12Z)
Evaluating AI-Generated Essays with GRE Analytical Writing Assessment [15.993966092824335]
This study examines essays generated by ten leading LLMs for the analytical writing assessment of the Graduate Record Exam (GRE) We assessed these essays using both human raters and the e-rater automated scoring engine as used in the GRE scoring pipeline. The top-performing Gemini and GPT-4o received an average score of 4.78 and 4.67, respectively.
arXiv Detail & Related papers (2024-10-22T21:30:58Z)
Are Large Language Models Good Essay Graders? [4.134395287621344]
We evaluate Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset. ChatGPT tends to be harsher and further misaligned with human evaluations than Llama.
arXiv Detail & Related papers (2024-09-19T23:20:49Z)
Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers [25.268709339109893]
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories. We work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models) We compare GPT-4, Claude-2.1, and LLama-2-70B and find that all three models make faithfulness mistakes in over 50% of summaries.
arXiv Detail & Related papers (2024-03-02T01:52:14Z)
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
Human-in-the-loop Abstractive Dialogue Summarization [61.4108097664697]
We propose to incorporate different levels of human feedback into the training process. This will enable us to guide the models to capture the behaviors humans care about for summaries.
arXiv Detail & Related papers (2022-12-19T19:11:27Z)
Analyzing and Evaluating Faithfulness in Dialogue Summarization [67.07947198421421]
We first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues. We present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations.
arXiv Detail & Related papers (2022-10-21T07:22:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.