Related papers: HateModerate: Testing Hate Speech Detectors against Content Moderation Policies

HateModerate: Testing Hate Speech Detectors against Content Moderation Policies

URL: http://arxiv.org/abs/2307.12418v2
Date: Tue, 19 Mar 2024 02:17:22 GMT
Title: HateModerate: Testing Hate Speech Detectors against Content Moderation Policies
Authors: Jiangrui Zheng, Xueqing Liu, Guanqun Yang, Mirazul Haque, Xing Qian, Ravishka Rathnasuriya, Wei Yang, Girish Budhrani,
Abstract summary: We create HateModerate, a dataset for testing the behaviors of automated content moderators against content policies. We test the performance of state-of-the-art hate speech detectors against HateModerate. We observe significant improvement in the models' conformity to content policies while having comparable scores on the original test data.
Score: 6.893854392439938
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To protect users from massive hateful content, existing works studied automated hate speech detection. Despite the existing efforts, one question remains: do automated hate speech detectors conform to social media content policies? A platform's content policies are a checklist of content moderated by the social media platform. Because content moderation rules are often uniquely defined, existing hate speech datasets cannot directly answer this question. This work seeks to answer this question by creating HateModerate, a dataset for testing the behaviors of automated content moderators against content policies. First, we engage 28 annotators and GPT in a six-step annotation process, resulting in a list of hateful and non-hateful test suites matching each of Facebook's 41 hate speech policies. Second, we test the performance of state-of-the-art hate speech detectors against HateModerate, revealing substantial failures these models have in their conformity to the policies. Third, using HateModerate, we augment the training data of a top-downloaded hate detector on HuggingFace. We observe significant improvement in the models' conformity to content policies while having comparable scores on the original test data. Our dataset and code can be found in the attachment.

Related papers

HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation [67.69631485036665]
We conduct a comprehensive examination of hate speech regulations and strategies from three perspectives.<n>Our findings reveal significant inconsistencies in hate speech definitions and moderation practices across jurisdictions.<n>We suggest ideas and research direction for further exploration of a unified framework for automated hate speech moderation.
arXiv Detail & Related papers (2025-07-06T11:25:23Z)
A Hate Speech Moderated Chat Application: Use Case for GDPR and DSA Compliance [0.0]
This research presents a novel application capable of implementing legal and ethical reasoning into the content moderation process. Two use cases fundamental to online communication are presented and implemented using technologies such as GPT-3.5, Solid Pods, and the rule language Prova. The work proposes a novel approach to reason within different legal and ethical definitions of hate speech and plan the fitting counter hate speech.
arXiv Detail & Related papers (2024-10-10T08:28:38Z)
Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts [0.0]
We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts. The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate. The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level.
arXiv Detail & Related papers (2024-04-30T04:16:55Z)
Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B. We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively. We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z)
Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment [26.504056750529124]
We present GOTHate, a large-scale code-mixed crowdsourced dataset of around 51k posts for hate speech detection from Twitter. We benchmark it with 10 recent baselines and investigate how adding endogenous signals enhances the hate speech detection task. Our solution HEN-mBERT is a modular, multilingual, mixture-of-experts model that enriches the linguistic subspace with latent endogenous signals.
arXiv Detail & Related papers (2023-06-01T19:36:52Z)
Analyzing Norm Violations in Live-Stream Chat [49.120561596550395]
We study the first NLP study dedicated to detecting norm violations in conversations on live-streaming platforms. We define norm violation categories in live-stream chats and annotate 4,583 moderated comments from Twitch. Our results show that appropriate contextual information can boost moderation performance by 35%.
arXiv Detail & Related papers (2023-05-18T05:58:27Z)
A User-Driven Framework for Regulating and Auditing Social Media [94.70018274127231]
We propose that algorithmic filtering should be regulated with respect to a flexible, user-driven baseline. We require that the feeds a platform filters contain "similar" informational content as their respective baseline feeds. We present an auditing procedure that checks whether a platform honors this requirement.
arXiv Detail & Related papers (2023-04-20T17:53:34Z)
CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations. We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z)
Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods. Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art. In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z)
Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language. We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z)
Unsupervised Domain Adaptation for Hate Speech Detection Using a Data Augmentation Approach [6.497816402045099]
We propose an unsupervised domain adaptation approach to augment labeled data for hate speech detection. We show our approach improves Area under the Precision/Recall curve by as much as 42% and recall by as much as 278%.
arXiv Detail & Related papers (2021-07-27T15:01:22Z)
An Information Retrieval Approach to Building Datasets for Hate Speech Detection [3.587367153279349]
A common practice is to only annotate tweets containing known hate words'' A second challenge is that definitions of hate speech tend to be highly variable and subjective. Our key insight is that the rarity and subjectivity of hate speech are akin to that of relevance in information retrieval (IR)
arXiv Detail & Related papers (2021-06-17T19:25:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.