Everyone's Voice Matters: Quantifying Annotation Disagreement Using
Demographic Information
- URL: http://arxiv.org/abs/2301.05036v1
- Date: Thu, 12 Jan 2023 14:04:53 GMT
- Title: Everyone's Voice Matters: Quantifying Annotation Disagreement Using
Demographic Information
- Authors: Ruyuan Wan, Jaehyung Kim, Dongyeop Kang
- Abstract summary: We study whether the text of a task and annotators' demographic background information can be used to estimate the level of disagreement among annotators.
Our results show that knowing annotators' demographic information, like gender, ethnicity, and education level, helps predict disagreements.
- Score: 11.227630261409706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In NLP annotation, it is common to have multiple annotators label the text
and then obtain the ground truth labels based on the agreement of major
annotators. However, annotators are individuals with different backgrounds, and
minors' opinions should not be simply ignored. As annotation tasks become
subjective and topics are controversial in modern NLP tasks, we need NLP
systems that can represent people's diverse voices on subjective matters and
predict the level of diversity. This paper examines whether the text of the
task and annotators' demographic background information can be used to estimate
the level of disagreement among annotators. Particularly, we extract
disagreement labels from the annotators' voting histories in the five
subjective datasets, and then fine-tune language models to predict annotators'
disagreement. Our results show that knowing annotators' demographic
information, like gender, ethnicity, and education level, helps predict
disagreements. In order to distinguish the disagreement from the inherent
controversy from text content and the disagreement in the annotators' different
perspectives, we simulate everyone's voices with different combinations of
annotators' artificial demographics and examine its variance of the finetuned
disagreement predictor. Our paper aims to improve the annotation process for
more efficient and inclusive NLP systems through a novel disagreement
prediction mechanism. Our code and dataset are publicly available.
Related papers
- Reducing annotator bias by belief elicitation [3.0040661953201475]
We propose a simple method for handling bias in annotations without requirements on the number of annotators or instances.
We ask annotators about their beliefs of other annotators' judgements of an instance, under the hypothesis that these beliefs may provide more representative labels than judgements.
The results indicate that bias, defined as systematic differences between the two groups of annotators, is consistently reduced when asking for beliefs instead of judgements.
arXiv Detail & Related papers (2024-10-21T07:44:01Z) - A Taxonomy of Ambiguity Types for NLP [53.10379645698917]
We propose a taxonomy of ambiguity types as seen in English to facilitate NLP analysis.
Our taxonomy can help make meaningful splits in language ambiguity data, allowing for more fine-grained assessments of both datasets and model performance.
arXiv Detail & Related papers (2024-03-21T01:47:22Z) - Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks [9.110872603799839]
Supervised classification heavily depends on datasets annotated by humans.
In subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters.
In this work, we propose textbfAnnotator Awares for Texts (AART) for subjective classification tasks.
arXiv Detail & Related papers (2023-11-16T10:18:32Z) - Subjective Crowd Disagreements for Subjective Data: Uncovering
Meaningful CrowdOpinion with Population-level Learning [8.530934084017966]
We introduce emphCrowdOpinion, an unsupervised learning approach that uses language features and label distributions to pool similar items into larger samples of label distributions.
We use five publicly available benchmark datasets (with varying levels of annotator disagreements) from social media.
We also experiment in the wild using a dataset from Facebook, where annotations come from the platform itself by users reacting to posts.
arXiv Detail & Related papers (2023-07-07T22:09:46Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks [45.14664901245331]
A crucial problem in hate speech detection is determining whether a statement is offensive to a demographic group.
We construct a model that predicts individual annotator ratings on potentially offensive text.
We find that annotator ratings can be predicted using their demographic information and opinions on online content.
arXiv Detail & Related papers (2023-05-11T07:55:20Z) - AnnoBERT: Effectively Representing Multiple Annotators' Label Choices to
Improve Hate Speech Detection [18.823219608659986]
AnnoBERT is a first-of-its-kind architecture integrating annotator characteristics and label text to detect hate speech.
During training, the model associates annotators with their label choices given a piece of text.
During evaluation, when label information is not available, the model predicts the aggregated label given by the participating annotators.
arXiv Detail & Related papers (2022-12-20T16:30:11Z) - Distant finetuning with discourse relations for stance classification [55.131676584455306]
We propose a new method to extract data with silver labels from raw text to finetune a model for stance classification.
We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages.
Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater.
arXiv Detail & Related papers (2022-04-27T04:24:35Z) - On Guiding Visual Attention with Language Specification [76.08326100891571]
We use high-level language specification as advice for constraining the classification evidence to task-relevant features, instead of distractors.
We show that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data.
arXiv Detail & Related papers (2022-02-17T22:40:19Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.