Modeling and Analyzing Scorer Preferences in Short-Answer Math Questions
- URL: http://arxiv.org/abs/2306.00791v1
- Date: Thu, 1 Jun 2023 15:22:05 GMT
- Title: Modeling and Analyzing Scorer Preferences in Short-Answer Math Questions
- Authors: Mengxue Zhang and Neil Heffernan and Andrew Lan
- Abstract summary: We investigate a collection of models that account for the individual preferences and tendencies of each human scorer in the automated scoring task.
We conduct quantitative experiments and case studies to analyze the individual preferences and tendencies of scorers.
- Score: 2.277447144331876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated scoring of student responses to open-ended questions, including
short-answer questions, has great potential to scale to a large number of
responses. Recent approaches for automated scoring rely on supervised learning,
i.e., training classifiers or fine-tuning language models on a small number of
responses with human-provided score labels. However, since scoring is a
subjective process, these human scores are noisy and can be highly variable,
depending on the scorer. In this paper, we investigate a collection of models
that account for the individual preferences and tendencies of each human scorer
in the automated scoring task. We apply these models to a short-answer math
response dataset where each response is scored (often differently) by multiple
different human scorers. We conduct quantitative experiments to show that our
scorer models lead to improved automated scoring accuracy. We also conduct
quantitative experiments and case studies to analyze the individual preferences
and tendencies of scorers. We found that scorers can be grouped into several
obvious clusters, with each cluster having distinct features, and analyzed them
in detail.
Related papers
- Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Short Answer Grading Using One-shot Prompting and Text Similarity
Scoring Model [2.14986347364539]
We developed an automated short answer grading model that provided both analytic scores and holistic scores.
The accuracy and quadratic weighted kappa of our model were 0.67 and 0.71 on a subset of the publicly available ASAG dataset.
arXiv Detail & Related papers (2023-05-29T22:05:29Z) - SeedBERT: Recovering Annotator Rating Distributions from an Aggregated
Label [43.23903984174963]
We propose SeedBERT, a method for recovering annotator rating distributions from a single label.
Our human evaluations indicate that SeedBERT's attention mechanism is consistent with human sources of annotator disagreement.
arXiv Detail & Related papers (2022-11-23T18:35:15Z) - Multi-Scored Sleep Databases: How to Exploit the Multiple-Labels in
Automated Sleep Scoring [19.24428734909019]
We exploit the label smoothing technique together with a soft-consensus distribution to insert the multiple-knowledge in the training procedure of the model.
We introduce the averaged cosine similarity metric to quantify the similarity between the hypnodensity-graph generated by the models with-LSSC and the hypnodensity-graph generated by the scorer consensus.
arXiv Detail & Related papers (2022-07-05T09:41:21Z) - Automated Scoring for Reading Comprehension via In-context BERT Tuning [9.135673900486827]
In this paper, we report our (grand prize-winning) solution to the National Assessment of Education Progress (NAEP) automated scoring challenge for reading comprehension.
Our approach, in-context BERT fine-tuning, produces a single shared scoring model for all items with a carefully-designed input structure.
We demonstrate the effectiveness of our approach via local evaluations using the training dataset provided by the challenge.
arXiv Detail & Related papers (2022-05-19T21:16:15Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances.
Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers.
We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z) - Get It Scored Using AutoSAS -- An Automated System for Scoring Short
Answers [63.835172924290326]
We present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS)
We propose and explain the design and development of a system for SAS, namely AutoSAS.
AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts.
arXiv Detail & Related papers (2020-12-21T10:47:30Z) - Stacking Neural Network Models for Automatic Short Answer Scoring [0.0]
We propose the use of a stacking model based on neural network and XGBoost for classification process with sentence embedding feature.
Best model obtained an F1-score of 0.821 exceeding the previous work at the same dataset.
arXiv Detail & Related papers (2020-10-21T16:00:09Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.