Related papers: Prompting Large Language Models for Zero-shot Essay Scoring via Multi-trait Specialization

Prompting Large Language Models for Zero-shot Essay Scoring via Multi-trait Specialization

URL: http://arxiv.org/abs/2404.04941v1
Date: Sun, 7 Apr 2024 12:25:35 GMT
Title: Prompting Large Language Models for Zero-shot Essay Scoring via Multi-trait Specialization
Authors: Sanwoo Lee, Yida Cai, Desong Meng, Ziyang Wang, Yunfang Wu,
Abstract summary: Multi Trait (MTS) is a framework to elicit essay scoring capabilities in large language models (LLMs) With the help of MTS, the small-sized Llama2-13b-chat substantially outperforms ChatGPT, facilitating an effective deployment in real applications.
Score: 12.66710643199155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advances in automated essay scoring (AES) have traditionally relied on labeled essays, requiring tremendous cost and expertise for their acquisition. Recently, large language models (LLMs) have achieved great success in various tasks, but their potential is less explored in AES. In this paper, we propose Multi Trait Specialization (MTS), a zero-shot prompting framework to elicit essay scoring capabilities in LLMs. Specifically, we leverage ChatGPT to decompose writing proficiency into distinct traits and generate scoring criteria for each trait. Then, an LLM is prompted to extract trait scores from several conversational rounds, each round scoring one of the traits based on the scoring criteria. Finally, we derive the overall score via trait averaging and min-max scaling. Experimental results on two benchmark datasets demonstrate that MTS consistently outperforms straightforward prompting (Vanilla) in average QWK across all LLMs and datasets, with maximum gains of 0.437 on TOEFL11 and 0.355 on ASAP. Additionally, with the help of MTS, the small-sized Llama2-13b-chat substantially outperforms ChatGPT, facilitating an effective deployment in real applications.

Related papers

Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring [6.459215652021233]
We propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities. Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK.
arXiv Detail & Related papers (2025-04-08T07:10:51Z)
Teach-to-Reason with Scoring: Self-Explainable Rationale-Driven Multi-Trait Essay Scoring [5.632624116225276]
Multi-trait automated essay scoring (AES) systems provide a fine-grained evaluation of an essay's diverse aspects. Prior systems fail to explain why specific trait scores are assigned. We propose a self-explainable Rationale-Driven Multi-trait automated Essay scoring framework.
arXiv Detail & Related papers (2025-02-28T05:54:23Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales. We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
Rationale Behind Essay Scores: Enhancing S-LLM's Multi-Trait Essay Scoring with Rationale Generated by LLMs [2.324913904215885]
This paper introduces Rationale-based Multiple Trait Scoring (RMTS), a novel approach for multi-trait essay scoring. RMTS integrates prompt-engineering-based large language models (LLMs) with a fine-tuning-based essay scoring model using a smaller large language model (S-LLM) Experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize, show that RMTS significantly outperforms state-of-the-art models and vanilla S-LLMs in trait-specific scoring.
arXiv Detail & Related papers (2024-10-18T06:35:17Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition [0.09208007322096534]
Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES) LLMs have shown promise in AES, but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters. This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays.
arXiv Detail & Related papers (2024-07-08T08:37:00Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Can Large Language Models Automatically Score Proficiency of Written Essays? [3.993602109661159]
Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. We test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays.
arXiv Detail & Related papers (2024-03-10T09:39:00Z)
The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics [36.52897053496835]
generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples. We introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation. We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset.
arXiv Detail & Related papers (2023-10-30T17:55:08Z)
BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS) We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
Large Language Models are Zero-Shot Reasoners [28.6899375595088]
Chain of thought (CoT) prompting is a technique for eliciting complex multi-step reasoning through step-by-step answer examples. We show that LLMs are decent zero-shot reasoners by simply adding Let's think step by step'' before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances.
arXiv Detail & Related papers (2022-05-24T09:22:26Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.