Are Large Language Models Good Essay Graders?
- URL: http://arxiv.org/abs/2409.13120v1
- Date: Thu, 19 Sep 2024 23:20:49 GMT
- Title: Are Large Language Models Good Essay Graders?
- Authors: Anindita Kundu, Denilson Barbosa,
- Abstract summary: We evaluate Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading.
We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset.
ChatGPT tends to be harsher and further misaligned with human evaluations than Llama.
- Score: 4.134395287621344
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. More precisely, we evaluate ChatGPT and Llama in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in Education. We consider both zero-shot and few-shot learning and different prompting approaches. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset, a well-known benchmark for the AES task. Our research reveals that both LLMs generally assign lower scores compared to those provided by the human raters; moreover, those scores do not correlate well with those provided by the humans. In particular, ChatGPT tends to be harsher and further misaligned with human evaluations than Llama. We also experiment with a number of essay features commonly used by previous AES methods, related to length, usage of connectives and transition words, and readability metrics, including the number of spelling and grammar mistakes. We find that, generally, none of these features correlates strongly with human or LLM scores. Finally, we report results on Llama 3, which are generally better across the board, as expected. Overall, while LLMs do not seem an adequate replacement for human grading, our results are somewhat encouraging for their use as a tool to assist humans in the grading of written essays in the future.
Related papers
- Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition [0.09208007322096534]
Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES)
LLMs have shown promise in AES, but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters.
This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays.
arXiv Detail & Related papers (2024-07-08T08:37:00Z) - Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring [21.7782670140939]
Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments.
While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear.
This paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores.
arXiv Detail & Related papers (2024-07-04T22:26:20Z) - Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition.
MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs.
The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z) - Can Large Language Models Automatically Score Proficiency of Written Essays? [3.993602109661159]
Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks.
We test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays.
arXiv Detail & Related papers (2024-03-10T09:39:00Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise
Comparisons using Large Language Models [55.60306377044225]
Large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks.
This paper explores two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment.
For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring.
arXiv Detail & Related papers (2023-07-15T22:02:12Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.