Beyond human subjectivity and error: a novel AI grading system
- URL: http://arxiv.org/abs/2405.04323v1
- Date: Tue, 7 May 2024 13:49:59 GMT
- Title: Beyond human subjectivity and error: a novel AI grading system
- Authors: Alexandra Gobrecht, Felix Tuma, Moritz Möller, Thomas Zöller, Mark Zakhvatkin, Alexandra Wuttig, Holger Sommerfeldt, Sven Schütt,
- Abstract summary: The grading of open-ended questions is a high-effort, high-impact task in education.
Recent breakthroughs in AI technology might facilitate such automation, but this has not been demonstrated at scale.
We introduce a novel automatic short answer grading (ASAG) system.
- Score: 67.410870290301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The grading of open-ended questions is a high-effort, high-impact task in education. Automating this task promises a significant reduction in workload for education professionals, as well as more consistent grading outcomes for students, by circumventing human subjectivity and error. While recent breakthroughs in AI technology might facilitate such automation, this has not been demonstrated at scale. It this paper, we introduce a novel automatic short answer grading (ASAG) system. The system is based on a fine-tuned open-source transformer model which we trained on large set of exam data from university courses across a large range of disciplines. We evaluated the trained model's performance against held-out test data in a first experiment and found high accuracy levels across a broad spectrum of unseen questions, even in unseen courses. We further compared the performance of our model with that of certified human domain experts in a second experiment: we first assembled another test dataset from real historical exams - the historic grades contained in that data were awarded to students in a regulated, legally binding examination process; we therefore considered them as ground truth for our experiment. We then asked certified human domain experts and our model to grade the historic student answers again without disclosing the historic grades. Finally, we compared the hence obtained grades with the historic grades (our ground truth). We found that for the courses examined, the model deviated less from the official historic grades than the human re-graders - the model's median absolute error was 44 % smaller than the human re-graders', implying that the model is more consistent than humans in grading. These results suggest that leveraging AI enhanced grading can reduce human subjectivity, improve consistency and thus ultimately increase fairness.
Related papers
- Auditing an Automatic Grading Model with deep Reinforcement Learning [0.0]
We explore the use of deep reinforcement learning to audit an automatic short answer grading (ASAG) model.
We show that a high level of agreement to human ratings does not give sufficient evidence that an ASAG model is infallible.
arXiv Detail & Related papers (2024-05-11T20:07:09Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - Dialogue-Contextualized Re-ranking for Medical History-Taking [5.039849340960835]
We present a two-stage re-ranking approach that helps close the training-inference gap by re-ranking the first-stage question candidates.
We find that relative to the expert system, the best performance is achieved by our proposed global re-ranker with a transformer backbone.
arXiv Detail & Related papers (2023-04-04T17:31:32Z) - Cheating Automatic Short Answer Grading: On the Adversarial Usage of
Adjectives and Adverbs [0.0]
We devise a black-box adversarial attack tailored to the educational short answer grading scenario to investigate the grading models' robustness.
We observed a loss of prediction accuracy between 10 and 22 percentage points using the state-of-the-art models BERT and T5.
Based on our experiments, we provide recommendations for utilizing automatic grading systems more safely in practice.
arXiv Detail & Related papers (2022-01-20T17:34:33Z) - Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models.
Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z) - Predicting student performance using data from an auto-grading system [0.0]
We build decision-tree and linear-regression models with various features extracted from the Marmoset auto-grading system.
We show that the linear-regression model using submission time intervals performs the best among all models in terms of Precision and F-Measure.
We also show that for students who are misclassified into poor-performance students, they have the lowest actual grades in the linear-regression model among all models.
arXiv Detail & Related papers (2021-02-02T03:02:39Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.