Related papers: Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring

Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring

URL: http://arxiv.org/abs/2504.05736v1
Date: Tue, 08 Apr 2025 07:10:51 GMT
Title: Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring
Authors: Yida Cai, Kun Liang, Sanwoo Lee, Qinghan Wang, Yunfang Wu,
Abstract summary: We propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities.<n> Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK.
Score: 6.459215652021233
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, large language models (LLMs) achieve remarkable success across a variety of tasks. However, their potential in the domain of Automated Essay Scoring (AES) remains largely underexplored. Moreover, compared to English data, the methods for Chinese AES is not well developed. In this paper, we propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities. Specifically, we fine-tune the ranking model (Ranker) with feature-enriched data, and then feed the output of the ranking model, in the form of a candidate score set, with the essay content into the scoring model (Scorer) to produce the final score. Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK across all LLMs and datasets, and achieves the best performance on Chinese essay scoring using the HSK dataset.

Related papers

Integrated ensemble of BERT- and features-based models for authorship attribution in Japanese literary works [2.624902795082451]
Authorship attribution (AA) tasks rely on statistical data analysis and classification based on stylistic features extracted from texts. In this study, we aimed to significantly improve performance using an integrated integrative ensemble of traditional feature-based and modern PLM-based methods on an AA task in a small sample.
arXiv Detail & Related papers (2025-04-11T13:40:50Z)
RDBE: Reasoning Distillation-Based Evaluation Enhances Automatic Essay Scoring [0.0]
Reasoning Distillation-Based Evaluation (RDBE) integrates interpretability to elucidate the rationale behind model scores. Our experimental results demonstrate the efficacy of RDBE across all scoring rubrics considered in the dataset.
arXiv Detail & Related papers (2024-07-03T05:49:01Z)
Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring [12.66710643199155]
Multi Traits' framework elicits ample potential for large language models. We derive the overall score via trait averaging and min-max scaling. With the help of MTS, the small-sized Llama2-13b-chat substantially outperforms ChatGPT.
arXiv Detail & Related papers (2024-04-07T12:25:35Z)
MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering [64.6741991162092]
We present MinPrompt, a minimal data augmentation framework for open-domain question answering. We transform the raw text into a graph structure to build connections between different factual sentences. We then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text. We generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model.
arXiv Detail & Related papers (2023-10-08T04:44:36Z)
Zero-Shot Listwise Document Reranking with a Large Language Model [58.64141622176841]
We propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific training data. Experiments on three TREC web search datasets demonstrate that LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but can also act as a final-stage reranker.
arXiv Detail & Related papers (2023-05-03T14:45:34Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Alibaba-Translate China's Submission for WMT 2022 Quality Estimation Shared Task [80.22825549235556]
We present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE. Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model. Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings.
arXiv Detail & Related papers (2022-10-18T08:55:27Z)
Improving Performance of Automated Essay Scoring by using back-translation essays and adjusted scores [0.0]
We propose a method to increase the number of essay-score pairs using back-translation and score adjustment. We evaluate the effectiveness of the augmented data using models from prior work. The performance of the models was improved by using augmented data to train the models.
arXiv Detail & Related papers (2022-03-01T11:05:43Z)
From Good to Best: Two-Stage Training for Cross-lingual Machine Reading Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance. The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer. The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z)
Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data. Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.