Crowdsourcing subjective annotations using pairwise comparisons reduces
bias and error compared to the majority-vote method
- URL: http://arxiv.org/abs/2305.20042v2
- Date: Thu, 1 Jun 2023 13:18:27 GMT
- Title: Crowdsourcing subjective annotations using pairwise comparisons reduces
bias and error compared to the majority-vote method
- Authors: Hasti Narimanzadeh, Arash Badie-Modiri, Iuliia Smirnova, Ted Hsuan Yun
Chen
- Abstract summary: We introduce a theoretical framework for understanding how random error and measurement bias enter into crowdsourced annotations of subjective constructs.
We then propose a pipeline that combines pairwise comparison labelling with Elo scoring, and demonstrate that it outperforms the ubiquitous majority-voting method in reducing both types of measurement error.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How to better reduce measurement variability and bias introduced by
subjectivity in crowdsourced labelling remains an open question. We introduce a
theoretical framework for understanding how random error and measurement bias
enter into crowdsourced annotations of subjective constructs. We then propose a
pipeline that combines pairwise comparison labelling with Elo scoring, and
demonstrate that it outperforms the ubiquitous majority-voting method in
reducing both types of measurement error. To assess the performance of the
labelling approaches, we constructed an agent-based model of crowdsourced
labelling that lets us introduce different types of subjectivity into the
tasks. We find that under most conditions with task subjectivity, the
comparison approach produced higher $f_1$ scores. Further, the comparison
approach is less susceptible to inflating bias, which majority voting tends to
do. To facilitate applications, we show with simulated and real-world data that
the number of required random comparisons for the same classification accuracy
scales log-linearly $O(N \log N)$ with the number of labelled items. We also
implemented the Elo system as an open-source Python package.
Related papers
- Active Preference Learning for Ordering Items In- and Out-of-sample [7.0774164818430565]
actively sampling item pairs can reduce the number of annotations necessary for learning an accurate ordering.
Many algorithms ignore shared structure between items.
It is also common to disregard how noise in comparisons varies between item pairs.
arXiv Detail & Related papers (2024-05-05T21:44:03Z) - $F_β$-plot -- a visual tool for evaluating imbalanced data classifiers [0.0]
The paper proposes a simple approach to analyzing the popular parametric metric $F_beta$.
It is possible to indicate for a given pool of analyzed classifiers when a given model should be preferred depending on user requirements.
arXiv Detail & Related papers (2024-04-11T18:07:57Z) - Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks [9.110872603799839]
Supervised classification heavily depends on datasets annotated by humans.
In subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters.
In this work, we propose textbfAnnotator Awares for Texts (AART) for subjective classification tasks.
arXiv Detail & Related papers (2023-11-16T10:18:32Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - Learning with Noisy Labels by Targeted Relabeling [52.0329205268734]
Crowdsourcing platforms are often used to collect datasets for training deep neural networks.
We propose an approach which reserves a fraction of annotations to explicitly relabel highly probable labeling errors.
arXiv Detail & Related papers (2021-10-15T20:37:29Z) - Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise
Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items.
We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons.
We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z) - Disentangling Sampling and Labeling Bias for Learning in Large-Output
Spaces [64.23172847182109]
We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels.
We provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance.
arXiv Detail & Related papers (2021-05-12T15:40:13Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - OpinionRank: Extracting Ground Truth Labels from Unreliable Expert
Opinions with Graph-Based Spectral Ranking [2.1930130356902207]
crowdsourcing has emerged as a popular, inexpensive, and efficient data mining solution for performing distributed label collection.
We propose OpinionRank, a model-free, interpretable, graph-based spectral algorithm for integrating crowdsourced annotations into reliable labels.
Our experiments show that OpinionRank performs favorably when compared against more highly parameterized algorithms.
arXiv Detail & Related papers (2021-02-11T08:12:44Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.