Leveraging Human Feedback to Scale Educational Datasets: Combining
Crowdworkers and Comparative Judgement
- URL: http://arxiv.org/abs/2305.12894v2
- Date: Thu, 9 Nov 2023 18:02:58 GMT
- Title: Leveraging Human Feedback to Scale Educational Datasets: Combining
Crowdworkers and Comparative Judgement
- Authors: Owen Henkel and Libby Hills
- Abstract summary: This paper reports on two experiments investigating using non-expert crowdworkers and comparative judgement to evaluate student data.
We found that using comparative judgement substantially improved inter-rater reliability on both tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine Learning models have many potentially beneficial applications in
education settings, but a key barrier to their development is securing enough
data to train these models. Labelling educational data has traditionally relied
on highly skilled raters using complex, multi-class rubrics, making the process
expensive and difficult to scale. An alternative, more scalable approach could
be to use non-expert crowdworkers to evaluate student work, however,
maintaining sufficiently high levels of accuracy and inter-rater reliability
when using non-expert workers is challenging. This paper reports on two
experiments investigating using non-expert crowdworkers and comparative
judgement to evaluate complex student data. Crowdworkers were hired to evaluate
student responses to open-ended reading comprehension questions. Crowdworkers
were randomly assigned to one of two conditions: the control, where they were
asked to decide whether answers were correct or incorrect (i.e., a categorical
judgement), or the treatment, where they were shown the same question and
answers, but were instead asked to decide which of two candidate answers was
more correct (i.e., a comparative/preference-based judgement). We found that
using comparative judgement substantially improved inter-rater reliability on
both tasks. These results are in-line with well-established literature on the
benefits of comparative judgement in the field of educational assessment, as
well as with recent trends in artificial intelligence research, where
comparative judgement is becoming the preferred method for providing human
feedback on model outputs when working with non-expert crowdworkers. However,
to our knowledge, these results are novel and important in demonstrating the
beneficial effects of using the combination of comparative judgement and
crowdworkers to evaluate educational data.
Related papers
- Mitigating Observation Biases in Crowdsourced Label Aggregation [19.460509608096217]
One of the technical challenges in obtaining high-quality results from crowdsourcing is dealing with the variability and bias caused by the fact that it is humans execute the work.
In this study, we focus on the observation bias in crowdsourcing.
Variations in the frequency of worker responses and the complexity of tasks occur, which may affect the aggregation results.
arXiv Detail & Related papers (2023-02-25T15:19:13Z) - Assisting Human Decisions in Document Matching [52.79491990823573]
We devise a proxy matching task that allows us to evaluate which kinds of assistive information improve decision makers' performance.
We find that providing black-box model explanations reduces users' accuracy on the matching task.
On the other hand, custom methods that are designed to closely attend to some task-specific desiderata are found to be effective in improving user performance.
arXiv Detail & Related papers (2023-02-16T17:45:20Z) - In Search of Insights, Not Magic Bullets: Towards Demystification of the
Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria.
We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z) - A Comparative User Study of Human Predictions in Algorithm-Supported
Recidivism Risk Assessment [2.097880645003119]
We study the effects of using an algorithm-based risk assessment instrument to support the prediction of risk of criminalrecidivism.
The task is to predict whether a person who has been released from prison will commit a new crime, leading to re-incarceration.
arXiv Detail & Related papers (2022-01-26T17:40:35Z) - What Ingredients Make for an Effective Crowdsourcing Protocol for
Difficult NLU Data Collection Tasks? [31.39009622826369]
We compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality.
We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty.
We observe that the data from the iterative protocol with expert assessments is more challenging by several measures.
arXiv Detail & Related papers (2021-06-01T21:05:52Z) - Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances.
Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers.
We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z) - Predicting respondent difficulty in web surveys: A machine-learning
approach based on mouse movement features [3.6944296923226316]
This paper explores the predictive value of mouse-tracking data with regard to respondents' difficulty.
We use data from a survey on respondents' employment history and demographic information.
We develop a personalization method that adjusts for respondents' baseline mouse behavior and evaluate its performance.
arXiv Detail & Related papers (2020-11-05T10:54:33Z) - Long-Tailed Recognition Using Class-Balanced Experts [128.73438243408393]
We propose an ensemble of class-balanced experts that combines the strength of diverse classifiers.
Our ensemble of class-balanced experts reaches results close to state-of-the-art and an extended ensemble establishes a new state-of-the-art on two benchmarks for long-tailed recognition.
arXiv Detail & Related papers (2020-04-07T20:57:44Z) - The World is Not Binary: Learning to Rank with Grayscale Data for
Dialogue Response Selection [55.390442067381755]
We show that grayscale data can be automatically constructed without human effort.
Our method employs off-the-shelf response retrieval models and response generation models as automatic grayscale data generators.
Experiments on three benchmark datasets and four state-of-the-art matching models show that the proposed approach brings significant and consistent performance improvements.
arXiv Detail & Related papers (2020-04-06T06:34:54Z) - Studying the Effects of Cognitive Biases in Evaluation of Conversational
Agents [10.248512149493443]
We conduct a study with 77 crowdsourced workers to understand the role of cognitive biases, specifically anchoring bias, when humans are asked to evaluate the output of conversational agents.
We find increased consistency in ratings across two experimental conditions may be a result of anchoring bias.
arXiv Detail & Related papers (2020-02-18T23:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.