Related papers: Demoting Racial Bias in Hate Speech Detection

Demoting Racial Bias in Hate Speech Detection

URL: http://arxiv.org/abs/2005.12246v1
Date: Mon, 25 May 2020 17:43:22 GMT
Title: Demoting Racial Bias in Hate Speech Detection
Authors: Mengzhou Xia, Anjalie Field, Yulia Tsvetkov
Abstract summary: In current hate speech datasets, there exists a correlation between annotators' perceptions of toxicity and signals of African American English (AAE) In this paper, we use adversarial training to mitigate this bias, introducing a hate speech classifier that learns to detect toxic sentences while demoting confounds corresponding to AAE texts. Experimental results on a hate speech dataset and an AAE dataset suggest that our method is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.
Score: 39.376886409461775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In current hate speech datasets, there exists a high correlation between annotators' perceptions of toxicity and signals of African American English (AAE). This bias in annotated training data and the tendency of machine learning models to amplify it cause AAE text to often be mislabeled as abusive/offensive/hate speech with a high false positive rate by current hate speech classifiers. In this paper, we use adversarial training to mitigate this bias, introducing a hate speech classifier that learns to detect toxic sentences while demoting confounds corresponding to AAE texts. Experimental results on a hate speech dataset and an AAE dataset suggest that our method is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.

Related papers

Towards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection [7.762212551172391]
Implicit hate speech has emerged as a critical challenge for social media platforms.<n>We propose an approach to address the detection of implicit hate speech and enhance generalizability across diverse datasets.
arXiv Detail & Related papers (2025-06-19T17:23:08Z)
Compositional Generalisation for Explainable Hate Speech Detection [52.41588643566991]
Hate speech detection is key to online content moderation, but current models struggle to generalise beyond their training data.<n>We show that even when models are trained with more fine-grained, span-level annotations, they struggle to disentangle the meaning of these labels from the surrounding context.<n>We investigate whether training on a dataset where expressions occur with equal frequency across all contexts can improve generalisation.
arXiv Detail & Related papers (2025-06-04T13:07:36Z)
Evaluating Simple Debiasing Techniques in RoBERTa-based Hate Speech Detection Models [0.0]
AAE text is more likely to be misclassified as abusive/hateful compared to non-AAE text. Simple debiasing techniques have been developed in the past to counter this sort of disparity. We apply and evaluate these techniques in the scope of RoBERTa-based encoders.
arXiv Detail & Related papers (2025-01-26T07:18:51Z)
HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning [29.519687405350304]
We introduce a hate speech detection framework, HARE, which harnesses the reasoning capabilities of large language models (LLMs) to fill gaps in explanations of hate speech. Experiments on SBIC and Implicit Hate benchmarks show that our method, using model-generated data, consistently outperforms baselines. Our method enhances the explanation quality of trained models and improves generalization to unseen datasets.
arXiv Detail & Related papers (2023-11-01T06:09:54Z)
Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC) We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages. Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z)
DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi. Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z)
APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation Datasets [4.034948808542701]
APEACH is a method that allows the collection of hate speech generated by unspecified users. By controlling the crowd-generation of hate speech and adding only a minimum post-labeling, we create a corpus that enables the generalizable and fair evaluation of hate speech detection.
arXiv Detail & Related papers (2022-02-25T02:04:38Z)
Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods. Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art. In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z)
Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language. We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z)
Statistical Analysis of Perspective Scores on Hate Speech Detection [7.447951461558536]
State-of-the-art hate speech classifiers are efficient only when tested on the data with the same feature distribution as training data. In such a diverse data distribution relying on low level features is the main cause of deficiency due to natural bias in data. We show that, different hate speech datasets are very similar when it comes to extract their Perspective Scores.
arXiv Detail & Related papers (2021-06-22T17:17:35Z)
An Information Retrieval Approach to Building Datasets for Hate Speech Detection [3.587367153279349]
A common practice is to only annotate tweets containing known hate words'' A second challenge is that definitions of hate speech tend to be highly variable and subjective. Our key insight is that the rarity and subjectivity of hate speech are akin to that of relevance in information retrieval (IR)
arXiv Detail & Related papers (2021-06-17T19:25:39Z)
Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language. We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z)
Hate Speech Detection and Racial Bias Mitigation in Social Media based on BERT model [1.9336815376402716]
We introduce a transfer learning approach for hate speech detection based on an existing pre-trained language model called BERT. We evaluate the proposed model on two publicly available datasets annotated for racism, sexism, hate or offensive content on Twitter.
arXiv Detail & Related papers (2020-08-14T16:47:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.