Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems
- URL: http://arxiv.org/abs/2007.06796v5
- Date: Sun, 14 Nov 2021 15:11:00 GMT
- Title: Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems
- Authors: Anubha Kabra, Mehar Bhatia, Yaman Kumar, Junyi Jessy Li, Rajiv Ratn
Shah
- Abstract summary: We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
- Score: 64.4896118325552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic scoring engines have been used for scoring approximately fifteen
million test-takers in just the last three years. This number is increasing
further due to COVID-19 and the associated automation of education and testing.
Despite such wide usage, the AI-based testing literature of these "intelligent"
models is highly lacking. Most of the papers proposing new models rely only on
quadratic weighted kappa (QWK) based agreement with human raters for showing
model efficacy. However, this effectively ignores the highly multi-feature
nature of essay scoring. Essay scoring depends on features like coherence,
grammar, relevance, sufficiency and, vocabulary. To date, there has been no
study testing Automated Essay Scoring: AES systems holistically on all these
features. With this motivation, we propose a model agnostic adversarial
evaluation scheme and associated metrics for AES systems to test their natural
language understanding capabilities and overall robustness. We evaluate the
current state-of-the-art AES models using the proposed scheme and report the
results on five recent models. These models range from
feature-engineering-based approaches to the latest deep learning algorithms. We
find that AES models are highly overstable. Even heavy modifications(as much as
25%) with content unrelated to the topic of the questions do not decrease the
score produced by the models. On the other hand, irrelevant content, on
average, increases the scores, thus showing that the model evaluation strategy
and rubrics should be reconsidered. We also ask 200 human raters to score both
an original and adversarial response to seeing if humans can detect differences
between the two and whether they agree with the scores assigned by auto scores.
Related papers
- Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - Auditing an Automatic Grading Model with deep Reinforcement Learning [0.0]
We explore the use of deep reinforcement learning to audit an automatic short answer grading (ASAG) model.
We show that a high level of agreement to human ratings does not give sufficient evidence that an ASAG model is infallible.
arXiv Detail & Related papers (2024-05-11T20:07:09Z) - Transformer-based Joint Modelling for Automatic Essay Scoring and Off-Topic Detection [3.609048819576875]
We are proposing an unsupervised technique that jointly scores essays and detects off-topic essays.
Our proposed method outperforms the baseline we created and earlier conventional methods on two essay-scoring datasets.
arXiv Detail & Related papers (2024-03-24T21:44:14Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - A Study on the Evaluation of Generative Models [19.18642459565609]
Implicit generative models, which do not return likelihood values, have become prevalent in recent years.
In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset.
Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably.
arXiv Detail & Related papers (2022-06-22T09:27:31Z) - Improving Performance of Automated Essay Scoring by using
back-translation essays and adjusted scores [0.0]
We propose a method to increase the number of essay-score pairs using back-translation and score adjustment.
We evaluate the effectiveness of the augmented data using models from prior work.
The performance of the models was improved by using augmented data to train the models.
arXiv Detail & Related papers (2022-03-01T11:05:43Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Get It Scored Using AutoSAS -- An Automated System for Scoring Short
Answers [63.835172924290326]
We present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS)
We propose and explain the design and development of a system for SAS, namely AutoSAS.
AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts.
arXiv Detail & Related papers (2020-12-21T10:47:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.