On Analyzing Annotation Consistency in Online Abusive Behavior Datasets
- URL: http://arxiv.org/abs/2006.13507v1
- Date: Wed, 24 Jun 2020 06:34:25 GMT
- Title: On Analyzing Annotation Consistency in Online Abusive Behavior Datasets
- Authors: Md Rabiul Awal, Rui Cao, Roy Ka-Wei Lee, Sandra Mitrovi\'c
- Abstract summary: Researchers have proposed, collected, and annotated online abusive content datasets.
These datasets play a critical role in facilitating the research on online hate speech and abusive behaviors.
It is often contentious on what should be the true label of a given text as the semantic difference of the labels may be blurred.
- Score: 5.900114841365645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online abusive behavior is an important issue that breaks the cohesiveness of
online social communities and even raises public safety concerns in our
societies. Motivated by this rising issue, researchers have proposed,
collected, and annotated online abusive content datasets. These datasets play a
critical role in facilitating the research on online hate speech and abusive
behaviors. However, the annotation of such datasets is a difficult task; it is
often contentious on what should be the true label of a given text as the
semantic difference of the labels may be blurred (e.g., abusive and hate) and
often subjective. In this study, we proposed an analytical framework to study
the annotation consistency in online hate and abusive content datasets. We
applied our proposed framework to evaluate the consistency of the annotation in
three popular datasets that are widely used in online hate speech and abusive
behavior studies. We found that there is still a substantial amount of
annotation inconsistency in the existing datasets, particularly when the labels
are semantically similar.
Related papers
- UNIT: Unsupervised Online Instance Segmentation through Time [69.2787246878521]
We tackle the problem of class-agnostic unsupervised online instance segmentation and tracking.
We propose a new training recipe that enables the online tracking of objects.
Our network is trained on pseudo-labels, eliminating the need for manual annotations.
arXiv Detail & Related papers (2024-09-12T09:47:45Z) - HarmPot: An Annotation Framework for Evaluating Offline Harm Potential of Social Media Text [1.304892050913381]
We define "harm potential" as the potential for an online public post to cause real-world physical harm (i.e., violence)
In this paper, we discuss the development of a framework/annotation schema that allows annotating the data with different aspects of the text.
arXiv Detail & Related papers (2024-03-17T06:23:25Z) - From Categories to Classifiers: Name-Only Continual Learning by Exploring the Web [118.67589717634281]
Continual learning often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice.
We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation.
Our proposed solution leverages the expansive and ever-evolving internet to query and download uncurated webly-supervised data for image classification.
arXiv Detail & Related papers (2023-11-19T10:43:43Z) - A Taxonomy of Rater Disagreements: Surveying Challenges & Opportunities
from the Perspective of Annotating Online Toxicity [15.23055494327071]
Toxicity is an increasingly common and severe issue in online spaces.
A rich line of machine learning research has focused on computationally detecting and mitigating online toxicity.
Recent research has pointed out the importance of accounting for the subjective nature of this task.
arXiv Detail & Related papers (2023-11-07T21:00:51Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - Adapting to Online Label Shift with Provable Guarantees [137.89382409682233]
We formulate and investigate the problem of online label shift.
The non-stationarity and lack of supervision make the problem challenging to be tackled.
Our algorithms enjoy optimal dynamic regret, indicating that performance is competitive with a clairvoyant nature.
arXiv Detail & Related papers (2022-07-05T15:43:14Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Towards Ethics by Design in Online Abusive Content Detection [7.163723138100273]
The research effort has spread out across several closely related sub-areas, such as detection of hate speech, toxicity, cyberbullying, etc.
We bring ethical issues to forefront and propose a unified framework as a two-step process.
The novel framework is guided by the Ethics by Design principle and is a step towards building more accurate and trusted models.
arXiv Detail & Related papers (2020-10-28T13:10:24Z) - ETHOS: an Online Hate Speech Detection Dataset [6.59720246184989]
We present 'ETHOS', a textual dataset with two variants: binary and multi-label, based on YouTube and Reddit comments validated using the Figure-Eight crowdsourcing platform.
Our key assumption is that, even gaining a small amount of labelled data from such a time-consuming process, we can guarantee hate speech occurrences in the examined material.
arXiv Detail & Related papers (2020-06-11T08:59:57Z) - WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection [0.0]
We propose an original framework, based on the Wikipedia Comment corpus, with comment-level annotations of different types.
This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches.
We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection.
arXiv Detail & Related papers (2020-03-13T10:26:45Z) - Don't Judge an Object by Its Context: Learning to Overcome Contextual
Bias [113.44471186752018]
Existing models often leverage co-occurrences between objects and their context to improve recognition accuracy.
This work focuses on addressing such contextual biases to improve the robustness of the learnt feature representations.
arXiv Detail & Related papers (2020-01-09T18:31:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.