Many Hands Make Light Work: Using Essay Traits to Automatically Score
Essays
- URL: http://arxiv.org/abs/2102.00781v1
- Date: Mon, 1 Feb 2021 11:31:09 GMT
- Title: Many Hands Make Light Work: Using Essay Traits to Automatically Score
Essays
- Authors: Rahul Kumar, Sandeep Mathias, Sriparna Saha, Pushpak Bhattacharyya
- Abstract summary: We describe a way to score essays holistically using a multi-task learning (MTL) approach.
We compare our results with a single-task learning (STL) approach, using both LSTMs and BiLSTMs.
We find that MTL-based BiLSTM system gives the best results for scoring the essay holistically, as well as performing well on scoring the essay traits.
- Score: 41.851075178681015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most research in the area of automatic essay grading (AEG) is geared towards
scoring the essay holistically while there has also been some work done on
scoring individual essay traits. In this paper, we describe a way to score
essays holistically using a multi-task learning (MTL) approach, where scoring
the essay holistically is the primary task, and scoring the essay traits is the
auxiliary task. We compare our results with a single-task learning (STL)
approach, using both LSTMs and BiLSTMs. We also compare our results of the
auxiliary task with such tasks done in other AEG systems. To find out which
traits work best for different types of essays, we conduct ablation tests for
each of the essay traits. We also report the runtime and number of training
parameters for each system. We find that MTL-based BiLSTM system gives the best
results for scoring the essay holistically, as well as performing well on
scoring the essay traits.
Related papers
- Hey AI Can You Grade My Essay?: Automatic Essay Grading [1.03590082373586]
We introduce a new model that outperforms the state-of-the-art models in the field of automatic essay grading (AEG)
We have used the concept of collaborative and transfer learning, where one network will be responsible for checking the grammatical and structural features of the sentences of an essay while another network is responsible for scoring the overall idea present in the essay.
Our proposed model has shown the highest accuracy of 85.50%.
arXiv Detail & Related papers (2024-10-12T01:17:55Z) - Are Large Language Models Good Essay Graders? [4.134395287621344]
We evaluate Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading.
We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset.
ChatGPT tends to be harsher and further misaligned with human evaluations than Llama.
arXiv Detail & Related papers (2024-09-19T23:20:49Z) - Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression [27.152245569974678]
We develop two models that automatically score English essays across multiple dimensions.
Our systems achieve impressive performance in evaluation using three criteria: precision, F1 score, and Quadratic Weighted Kappa.
arXiv Detail & Related papers (2024-06-03T10:59:50Z) - The Power of Summary-Source Alignments [62.76959473193149]
Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection.
alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data.
This paper proposes extending the summary-source alignment framework by applying it at the more fine-grained proposition span level.
arXiv Detail & Related papers (2024-06-02T19:35:19Z) - Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring [12.66710643199155]
Multi Traits' framework elicits ample potential for large language models.
We derive the overall score via trait averaging and min-max scaling.
With the help of MTS, the small-sized Llama2-13b-chat substantially outperforms ChatGPT.
arXiv Detail & Related papers (2024-04-07T12:25:35Z) - Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring [3.6825890616838066]
Automated essay scoring (AES) aims to score essays written for a given prompt, which defines the writing topic.
Most existing AES systems assume to grade essays of the same prompt as used in training and assign only a holistic score.
We propose a robust model: prompt- and trait relation-aware cross-prompt essay trait scorer.
arXiv Detail & Related papers (2023-05-26T11:11:19Z) - Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A
Preliminary Study on Writing Assistance [60.40541387785977]
Small foundational models can display remarkable proficiency in tackling diverse tasks when fine-tuned using instruction-driven data.
In this work, we investigate a practical problem setting where the primary focus is on one or a few particular tasks rather than general-purpose instruction following.
Experimental results show that fine-tuning LLaMA on writing instruction data significantly improves its ability on writing tasks.
arXiv Detail & Related papers (2023-05-22T16:56:44Z) - AI, write an essay for me: A large-scale comparison of human-written
versus ChatGPT-generated essays [66.36541161082856]
ChatGPT and similar generative AI models have attracted hundreds of millions of users.
This study compares human-written versus ChatGPT-generated argumentative student essays.
arXiv Detail & Related papers (2023-04-24T12:58:28Z) - "It's a Match!" -- A Benchmark of Task Affinity Scores for Joint
Learning [74.14961250042629]
Multi-Task Learning (MTL) promises attractive, characterizing the conditions of its success is still an open problem in Deep Learning.
Estimateing task affinity for joint learning is a key endeavor.
Recent work suggests that the training conditions themselves have a significant impact on the outcomes of MTL.
Yet, the literature is lacking a benchmark to assess the effectiveness of tasks affinity estimation techniques.
arXiv Detail & Related papers (2023-01-07T15:16:35Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.