Related papers: Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

URL: http://arxiv.org/abs/2407.05733v1
Date: Mon, 8 Jul 2024 08:37:00 GMT
Title: Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition
Authors: Seungju Kim, Meounggun Jo,
Abstract summary: Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES) LLMs have shown promise in AES, but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters. This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays.
Score: 0.09208007322096534
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES), but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters. However, fine-tuning LLMs for each specific task is impractical due to the variety of essay prompts and rubrics used in real-world educational contexts. This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays. We demonstrate that a CJ method surpasses traditional rubric-based scoring in essay scoring using LLMs.

Related papers

LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models [0.46040036610482665]
We propose Comparative Essay Scoring (LCES), a method that formulates AES as a pairwise comparison task.<n>Specifically, we instruct LLMs to judge which of two essays is better, collect many such comparisons, and convert them into continuous scores.<n>Experiments using AES benchmark datasets show that LCES outperforms conventional zero-shot methods in accuracy while maintaining computational efficiency.
arXiv Detail & Related papers (2025-05-13T12:26:16Z)
Improve LLM-based Automatic Essay Scoring with Linguistic Features [46.41475844992872]
This paper develops a scoring system capable of handling essays across diverse prompts. Existing methods typically fall into two categories: supervised feature-based approaches and large language model (LLM)-based methods.
arXiv Detail & Related papers (2025-02-13T17:09:52Z)
Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach. This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z)
Are Large Language Models Good Essay Graders? [4.134395287621344]
We evaluate Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset. ChatGPT tends to be harsher and further misaligned with human evaluations than Llama.
arXiv Detail & Related papers (2024-09-19T23:20:49Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks [3.58262772907022]
We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other's responses to produce a ranking. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge.
arXiv Detail & Related papers (2024-06-12T19:05:43Z)
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions [77.66677127535222]
Auto-Arena is an innovative framework that automates the entire evaluation process using LLM-powered agents. In our experiments, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks.
arXiv Detail & Related papers (2024-05-30T17:19:19Z)
Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring [12.66710643199155]
Multi Traits' framework elicits ample potential for large language models. We derive the overall score via trait averaging and min-max scaling. With the help of MTS, the small-sized Llama2-13b-chat substantially outperforms ChatGPT.
arXiv Detail & Related papers (2024-04-07T12:25:35Z)
Can Large Language Models Automatically Score Proficiency of Written Essays? [3.993602109661159]
Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. We test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays.
arXiv Detail & Related papers (2024-03-10T09:39:00Z)
LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models [55.60306377044225]
Large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks. This paper explores two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment. For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring.
arXiv Detail & Related papers (2023-07-15T22:02:12Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.