When Do Annotator Demographics Matter? Measuring the Influence of
Annotator Demographics with the POPQUORN Dataset
- URL: http://arxiv.org/abs/2306.06826v2
- Date: Mon, 28 Aug 2023 21:14:35 GMT
- Title: When Do Annotator Demographics Matter? Measuring the Influence of
Annotator Demographics with the POPQUORN Dataset
- Authors: Jiaxin Pei and David Jurgens
- Abstract summary: We show that annotators' background plays a significant role in their judgments.
Backgrounds not previously considered in NLP (e.g., education) are meaningful and should be considered.
Our study suggests that understanding the background of annotators and collecting labels from a demographically balanced pool of crowd workers is important to reduce the bias of datasets.
- Score: 19.591722115337564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Annotators are not fungible. Their demographics, life experiences, and
backgrounds all contribute to how they label data. However, NLP has only
recently considered how annotator identity might influence their decisions.
Here, we present POPQUORN (the POtato-Prolific dataset for QUestion-Answering,
Offensiveness, text Rewriting, and politeness rating with demographic Nuance).
POPQUORN contains 45,000 annotations from 1,484 annotators, drawn from a
representative sample regarding sex, age, and race as the US population.
Through a series of analyses, we show that annotators' background plays a
significant role in their judgments. Further, our work shows that backgrounds
not previously considered in NLP (e.g., education), are meaningful and should
be considered. Our study suggests that understanding the background of
annotators and collecting labels from a demographically balanced pool of crowd
workers is important to reduce the bias of datasets. The dataset, annotator
background, and annotation interface are available at
https://github.com/Jiaxin-Pei/potato-prolific-dataset .
Related papers
- Which Demographics do LLMs Default to During Annotation? [9.190535758368567]
Two research directions developed in the context of using large language models (LLM) for data annotations.
We evaluate which attributes of human annotators LLMs inherently mimic.
We observe notable influences related to gender, race, and age in demographic prompting.
arXiv Detail & Related papers (2024-10-11T14:02:42Z) - Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks [9.110872603799839]
Supervised classification heavily depends on datasets annotated by humans.
In subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters.
In this work, we propose textbfAnnotator Awares for Texts (AART) for subjective classification tasks.
arXiv Detail & Related papers (2023-11-16T10:18:32Z) - Aligning with Whom? Large Language Models Have Gender and Racial Biases
in Subjective NLP Tasks [15.015148115215315]
We conduct experiments on four popular large language models (LLMs) to investigate their capability to understand group differences and potential biases in their predictions for politeness and offensiveness.
We find that for both tasks, model predictions are closer to the labels from White and female participants.
More specifically, when being prompted to respond from the perspective of "Black" and "Asian" individuals, models show lower performance in predicting both overall scores as well as the scores from corresponding groups.
arXiv Detail & Related papers (2023-11-16T10:02:24Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - Subjective Crowd Disagreements for Subjective Data: Uncovering
Meaningful CrowdOpinion with Population-level Learning [8.530934084017966]
We introduce emphCrowdOpinion, an unsupervised learning approach that uses language features and label distributions to pool similar items into larger samples of label distributions.
We use five publicly available benchmark datasets (with varying levels of annotator disagreements) from social media.
We also experiment in the wild using a dataset from Facebook, where annotations come from the platform itself by users reacting to posts.
arXiv Detail & Related papers (2023-07-07T22:09:46Z) - When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks [45.14664901245331]
A crucial problem in hate speech detection is determining whether a statement is offensive to a demographic group.
We construct a model that predicts individual annotator ratings on potentially offensive text.
We find that annotator ratings can be predicted using their demographic information and opinions on online content.
arXiv Detail & Related papers (2023-05-11T07:55:20Z) - Everyone's Voice Matters: Quantifying Annotation Disagreement Using
Demographic Information [11.227630261409706]
We study whether the text of a task and annotators' demographic background information can be used to estimate the level of disagreement among annotators.
Our results show that knowing annotators' demographic information, like gender, ethnicity, and education level, helps predict disagreements.
arXiv Detail & Related papers (2023-01-12T14:04:53Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Balancing Biases and Preserving Privacy on Balanced Faces in the Wild [50.915684171879036]
There are demographic biases present in current facial recognition (FR) models.
We introduce our Balanced Faces in the Wild dataset to measure these biases across different ethnic and gender subgroups.
We find that relying on a single score threshold to differentiate between genuine and imposters sample pairs leads to suboptimal results.
We propose a novel domain adaptation learning scheme that uses facial features extracted from state-of-the-art neural networks.
arXiv Detail & Related papers (2021-03-16T15:05:49Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z) - Contrastive Examples for Addressing the Tyranny of the Majority [83.93825214500131]
We propose to create a balanced training dataset, consisting of the original dataset plus new data points in which the group memberships are intervened.
We show that current generative adversarial networks are a powerful tool for learning these data points, called contrastive examples.
arXiv Detail & Related papers (2020-04-14T14:06:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.