Agreeing to Disagree: Annotating Offensive Language Datasets with
Annotators' Disagreement
- URL: http://arxiv.org/abs/2109.13563v1
- Date: Tue, 28 Sep 2021 08:55:04 GMT
- Title: Agreeing to Disagree: Annotating Offensive Language Datasets with
Annotators' Disagreement
- Authors: Elisa Leonardelli, Stefano Menini, Alessio Palmero Aprosio, Marco
Guerini, Sara Tonelli
- Abstract summary: We focus on the level of agreement among annotators while selecting data to create offensive language datasets.
Our study comprises the creation of three novel datasets of English tweets covering different topics.
We show that such hard cases, where low agreement is present, are not necessarily due to poor-quality annotation.
- Score: 7.288480094345606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since state-of-the-art approaches to offensive language detection rely on
supervised learning, it is crucial to quickly adapt them to the continuously
evolving scenario of social media. While several approaches have been proposed
to tackle the problem from an algorithmic perspective, so to reduce the need
for annotated data, less attention has been paid to the quality of these data.
Following a trend that has emerged recently, we focus on the level of agreement
among annotators while selecting data to create offensive language datasets, a
task involving a high level of subjectivity. Our study comprises the creation
of three novel datasets of English tweets covering different topics and having
five crowd-sourced judgments each. We also present an extensive set of
experiments showing that selecting training and test data according to
different levels of annotators' agreement has a strong effect on classifiers
performance and robustness. Our findings are further validated in cross-domain
experiments and studied using a popular benchmark dataset. We show that such
hard cases, where low agreement is present, are not necessarily due to
poor-quality annotation and we advocate for a higher presence of ambiguous
cases in future datasets, particularly in test sets, to better account for the
different points of view expressed online.
Related papers
- When a Language Question Is at Stake. A Revisited Approach to Label
Sensitive Content [0.0]
Article revisits an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war.
We provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus.
arXiv Detail & Related papers (2023-11-17T13:35:10Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Time Series Contrastive Learning with Information-Aware Augmentations [57.45139904366001]
A key component of contrastive learning is to select appropriate augmentations imposing some priors to construct feasible positive samples.
How to find the desired augmentations of time series data that are meaningful for given contrastive learning tasks and datasets remains an open question.
We propose a new contrastive learning approach with information-aware augmentations, InfoTS, that adaptively selects optimal augmentations for time series representation learning.
arXiv Detail & Related papers (2023-03-21T15:02:50Z) - Is one annotation enough? A data-centric image classification benchmark
for noisy and ambiguous label estimation [2.2807344448218503]
We propose a data-centric image classification benchmark with nine real-world datasets and multiple annotations per image.
We show that multiple annotations allow a better approximation of the real underlying class distribution.
We identify that hard labels can not capture the ambiguity of the data and this might lead to the common issue of overconfident models.
arXiv Detail & Related papers (2022-07-13T14:17:21Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Investigating User Radicalization: A Novel Dataset for Identifying
Fine-Grained Temporal Shifts in Opinion [7.028604573959653]
We introduce an innovative annotated dataset for modeling subtle opinion fluctuations and detecting fine-grained stances.
The dataset includes a sufficient amount of stance polarity and intensity labels per user over time and within entire conversational threads.
All posts are annotated by non-experts and a significant portion of the data is also annotated by experts.
arXiv Detail & Related papers (2022-04-16T09:31:25Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z) - SCDE: Sentence Cloze Dataset with High Quality Distractors From
Examinations [30.86193649398141]
We introduce SCDE, a dataset to evaluate the performance of computational models through sentence prediction.
SCDE is a human-created sentence cloze dataset, collected from public school English examinations.
arXiv Detail & Related papers (2020-04-27T16:48:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.