Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression
- URL: http://arxiv.org/abs/2406.01198v1
- Date: Mon, 3 Jun 2024 10:59:50 GMT
- Title: Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression
- Authors: Kun Sun, Rong Wang,
- Abstract summary: We develop two models that automatically score English essays across multiple dimensions.
Our systems achieve impressive performance in evaluation using three criteria: precision, F1 score, and Quadratic Weighted Kappa.
- Score: 27.152245569974678
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automated essay scoring (AES) involves predicting a score that reflects the writing quality of an essay. Most existing AES systems produce only a single overall score. However, users and L2 learners expect scores across different dimensions (e.g., vocabulary, grammar, coherence) for English essays in real-world applications. To address this need, we have developed two models that automatically score English essays across multiple dimensions by employing fine-tuning and other strategies on two large datasets. The results demonstrate that our systems achieve impressive performance in evaluation using three criteria: precision, F1 score, and Quadratic Weighted Kappa. Furthermore, our system outperforms existing methods in overall scoring.
Related papers
- Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring [12.66710643199155]
Multi Traits' framework elicits ample potential for large language models.
We derive the overall score via trait averaging and min-max scaling.
With the help of MTS, the small-sized Llama2-13b-chat substantially outperforms ChatGPT.
arXiv Detail & Related papers (2024-04-07T12:25:35Z) - Can Large Language Models Automatically Score Proficiency of Written Essays? [3.993602109661159]
Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks.
We test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays.
arXiv Detail & Related papers (2024-03-10T09:39:00Z) - Review of feedback in Automated Essay Scoring [6.445605125467574]
The first automated essay scoring system was developed 50 years ago.
This paper reviews research on feedback including different feedback types and essay traits on automated essay scoring.
arXiv Detail & Related papers (2023-07-09T11:04:13Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Many Hands Make Light Work: Using Essay Traits to Automatically Score
Essays [41.851075178681015]
We describe a way to score essays holistically using a multi-task learning (MTL) approach.
We compare our results with a single-task learning (STL) approach, using both LSTMs and BiLSTMs.
We find that MTL-based BiLSTM system gives the best results for scoring the essay holistically, as well as performing well on scoring the essay traits.
arXiv Detail & Related papers (2021-02-01T11:31:09Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z) - Get It Scored Using AutoSAS -- An Automated System for Scoring Short
Answers [63.835172924290326]
We present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS)
We propose and explain the design and development of a system for SAS, namely AutoSAS.
AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts.
arXiv Detail & Related papers (2020-12-21T10:47:30Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.