Using Sampling to Estimate and Improve Performance of Automated Scoring
Systems with Guarantees
- URL: http://arxiv.org/abs/2111.08906v1
- Date: Wed, 17 Nov 2021 05:00:51 GMT
- Title: Using Sampling to Estimate and Improve Performance of Automated Scoring
Systems with Guarantees
- Authors: Yaman Kumar Singla, Sriram Krishna, Rajiv Ratn Shah, Changyou Chen
- Abstract summary: We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently.
We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
- Score: 63.62448343531963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated Scoring (AS), the natural language processing task of scoring
essays and speeches in an educational testing setting, is growing in popularity
and being deployed across contexts from government examinations to companies
providing language proficiency services. However, existing systems either forgo
human raters entirely, thus harming the reliability of the test, or score every
response by both human and machine thereby increasing costs. We target the
spectrum of possible solutions in between, making use of both humans and
machines to provide a higher quality test while keeping costs reasonable to
democratize access to AS. In this work, we propose a combination of the
existing paradigms, sampling responses to be scored by humans intelligently. We
propose reward sampling and observe significant gains in accuracy (19.80%
increase on average) and quadratic weighted kappa (QWK) (25.60% on average)
with a relatively small human budget (30% samples) using our proposed sampling.
The accuracy increase observed using standard random and importance sampling
baselines are 8.6% and 12.2% respectively. Furthermore, we demonstrate the
system's model agnostic nature by measuring its performance on a variety of
models currently deployed in an AS setting as well as pseudo models. Finally,
we propose an algorithm to estimate the accuracy/QWK with statistical
guarantees (Our code is available at https://git.io/J1IOy).
Related papers
- It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives [40.197673152937256]
Training of statistical performance models often requires vast amounts of data, leading to a significant time investment and can be difficult in case of limited hardware availability.
We propose a novel performance modeling methodology that significantly reduces the number of training samples while maintaining good accuracy.
We achieve a Mean Absolute Percentage Error (MAPE) of as low as 0.02% for single-layer estimations and 0.68% for whole estimations with less than 10000 training samples.
arXiv Detail & Related papers (2024-06-12T15:34:28Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Balancing Cost and Quality: An Exploration of Human-in-the-loop
Frameworks for Automated Short Answer Scoring [36.58449231222223]
Short answer scoring (SAS) is the task of grading short text written by a learner.
We present the first study of exploring the use of human-in-the-loop framework for minimizing the grading cost.
We find that our human-in-the-loop framework allows automatic scoring models and human graders to achieve the target scoring quality.
arXiv Detail & Related papers (2022-06-16T16:43:18Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - IQDet: Instance-wise Quality Distribution Sampling for Object Detection [25.31113751275204]
We propose a dense object detector with an instance-wise sampling strategy, named IQDet.
Our best model achieves 51.6 AP, outperforming all existing state-of-the-art one-stage detectors and it is completely cost-free in inference time.
arXiv Detail & Related papers (2021-04-14T15:57:22Z) - Get It Scored Using AutoSAS -- An Automated System for Scoring Short
Answers [63.835172924290326]
We present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS)
We propose and explain the design and development of a system for SAS, namely AutoSAS.
AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts.
arXiv Detail & Related papers (2020-12-21T10:47:30Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.