Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type
Answers
- URL: http://arxiv.org/abs/2201.03425v1
- Date: Sun, 2 Jan 2022 12:17:24 GMT
- Title: Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type
Answers
- Authors: Johannes Schneider and Robin Richner and Micha Riser
- Abstract summary: This study uses a large dataset consisting of about 10 million question-answer pairs from multiple languages.
We show how to improve the accuracy of automatically graded answers, achieving accuracy equivalent to that of teaching assistants.
- Score: 2.2000998828262652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autograding short textual answers has become much more feasible due to the
rise of NLP and the increased availability of question-answer pairs brought
about by a shift to online education. Autograding performance is still inferior
to human grading. The statistical and black-box nature of state-of-the-art
machine learning models makes them untrustworthy, raising ethical concerns and
limiting their practical utility. Furthermore, the evaluation of autograding is
typically confined to small, monolingual datasets for a specific question type.
This study uses a large dataset consisting of about 10 million question-answer
pairs from multiple languages covering diverse fields such as math and
language, and strong variation in question and answer syntax. We demonstrate
the effectiveness of fine-tuning transformer models for autograding for such
complex datasets. Our best hyperparameter-tuned model yields an accuracy of
about 86.5\%, comparable to the state-of-the-art models that are less general
and more tuned to a specific type of question, subject, and language. More
importantly, we address trust and ethical concerns. By involving humans in the
autograding process, we show how to improve the accuracy of automatically
graded answers, achieving accuracy equivalent to that of teaching assistants.
We also show how teachers can effectively control the type of errors made by
the system and how they can validate efficiently that the autograder's
performance on individual exams is close to the expected performance.
Related papers
- Towards LLM-based Autograding for Short Textual Answers [4.853810201626855]
This manuscript is an evaluation of a large language model for the purpose of autograding.
Our findings suggest that while "out-of-the-box" LLMs provide a valuable tool, their readiness for independent automated grading remains a work in progress.
arXiv Detail & Related papers (2023-09-09T22:25:56Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Automatic Short Math Answer Grading via In-context Meta-learning [2.0263791972068628]
We study the problem of automatic short answer grading for students' responses to math questions.
We use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model.
Second, we use an in-context learning approach that provides scoring examples as input to the language model.
arXiv Detail & Related papers (2022-05-30T16:26:02Z) - Improving Performance of Automated Essay Scoring by using
back-translation essays and adjusted scores [0.0]
We propose a method to increase the number of essay-score pairs using back-translation and score adjustment.
We evaluate the effectiveness of the augmented data using models from prior work.
The performance of the models was improved by using augmented data to train the models.
arXiv Detail & Related papers (2022-03-01T11:05:43Z) - Cheating Automatic Short Answer Grading: On the Adversarial Usage of
Adjectives and Adverbs [0.0]
We devise a black-box adversarial attack tailored to the educational short answer grading scenario to investigate the grading models' robustness.
We observed a loss of prediction accuracy between 10 and 22 percentage points using the state-of-the-art models BERT and T5.
Based on our experiments, we provide recommendations for utilizing automatic grading systems more safely in practice.
arXiv Detail & Related papers (2022-01-20T17:34:33Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes.
Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z) - The World is Not Binary: Learning to Rank with Grayscale Data for
Dialogue Response Selection [55.390442067381755]
We show that grayscale data can be automatically constructed without human effort.
Our method employs off-the-shelf response retrieval models and response generation models as automatic grayscale data generators.
Experiments on three benchmark datasets and four state-of-the-art matching models show that the proposed approach brings significant and consistent performance improvements.
arXiv Detail & Related papers (2020-04-06T06:34:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.