Balancing Cost and Quality: An Exploration of Human-in-the-loop
Frameworks for Automated Short Answer Scoring
- URL: http://arxiv.org/abs/2206.08288v1
- Date: Thu, 16 Jun 2022 16:43:18 GMT
- Title: Balancing Cost and Quality: An Exploration of Human-in-the-loop
Frameworks for Automated Short Answer Scoring
- Authors: Hiroaki Funayama, Tasuku Sato, Yuichiroh Matsubayashi, Tomoya
Mizumoto, Jun Suzuki and Kentaro Inui
- Abstract summary: Short answer scoring (SAS) is the task of grading short text written by a learner.
We present the first study of exploring the use of human-in-the-loop framework for minimizing the grading cost.
We find that our human-in-the-loop framework allows automatic scoring models and human graders to achieve the target scoring quality.
- Score: 36.58449231222223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Short answer scoring (SAS) is the task of grading short text written by a
learner. In recent years, deep-learning-based approaches have substantially
improved the performance of SAS models, but how to guarantee high-quality
predictions still remains a critical issue when applying such models to the
education field. Towards guaranteeing high-quality predictions, we present the
first study of exploring the use of human-in-the-loop framework for minimizing
the grading cost while guaranteeing the grading quality by allowing a SAS model
to share the grading task with a human grader. Specifically, by introducing a
confidence estimation method for indicating the reliability of the model
predictions, one can guarantee the scoring quality by utilizing only
predictions with high reliability for the scoring results and casting
predictions with low reliability to human graders. In our experiments, we
investigate the feasibility of the proposed framework using multiple confidence
estimation methods and multiple SAS datasets. We find that our
human-in-the-loop framework allows automatic scoring models and human graders
to achieve the target scoring quality.
Related papers
- GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - On Uncertainty Calibration and Selective Generation in Probabilistic
Neural Summarization: A Benchmark Study [14.041071717005362]
Modern deep models for summarization attains impressive benchmark performance, but they are prone to generating miscalibrated predictive uncertainty.
This means that they assign high confidence to low-quality predictions, leading to compromised reliability and trustworthiness in real-world applications.
Probabilistic deep learning methods are common solutions to the miscalibration problem, but their relative effectiveness in complex autoregressive summarization tasks are not well-understood.
arXiv Detail & Related papers (2023-04-17T23:06:28Z) - Robust Deep Learning for Autonomous Driving [0.0]
We introduce a new criterion to reliably estimate model confidence: the true class probability ( TCP)
Since the true class is by essence unknown at test time, we propose to learn TCP criterion from data with an auxiliary model, introducing a specific learning scheme adapted to this context.
We tackle the challenge of jointly detecting misclassification and out-of-distributions samples by introducing a new uncertainty measure based on evidential models and defined on the simplex.
arXiv Detail & Related papers (2022-11-14T22:07:11Z) - Reliability-Aware Prediction via Uncertainty Learning for Person Image
Retrieval [51.83967175585896]
UAL aims at providing reliability-aware predictions by considering data uncertainty and model uncertainty simultaneously.
Data uncertainty captures the noise" inherent in the sample, while model uncertainty depicts the model's confidence in the sample's prediction.
arXiv Detail & Related papers (2022-10-24T17:53:20Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Using Sampling to Estimate and Improve Performance of Automated Scoring
Systems with Guarantees [63.62448343531963]
We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently.
We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
arXiv Detail & Related papers (2021-11-17T05:00:51Z) - Task-Specific Normalization for Continual Learning of Blind Image
Quality Models [105.03239956378465]
We present a simple yet effective continual learning method for blind image quality assessment (BIQA)
The key step in our approach is to freeze all convolution filters of a pre-trained deep neural network (DNN) for an explicit promise of stability.
We assign each new IQA dataset (i.e., task) a prediction head, and load the corresponding normalization parameters to produce a quality score.
The final quality estimate is computed by black a weighted summation of predictions from all heads with a lightweight $K$-means gating mechanism.
arXiv Detail & Related papers (2021-07-28T15:21:01Z) - Confidence Estimation via Auxiliary Models [47.08749569008467]
We introduce a novel target criterion for model confidence, namely the true class probability ( TCP)
We show that TCP offers better properties for confidence estimation than standard maximum class probability (MCP)
arXiv Detail & Related papers (2020-12-11T17:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.