On the Challenges of Building Datasets for Hate Speech Detection
- URL: http://arxiv.org/abs/2309.02912v1
- Date: Wed, 6 Sep 2023 11:15:47 GMT
- Title: On the Challenges of Building Datasets for Hate Speech Detection
- Authors: Vitthal Bhandari
- Abstract summary: We first analyze the issues surrounding hate speech detection through a data-centric lens.
We then outline a holistic framework to encapsulate the data creation pipeline across seven broad dimensions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detection of hate speech has been formulated as a standalone application of
NLP and different approaches have been adopted for identifying the target
groups, obtaining raw data, defining the labeling process, choosing the
detection algorithm, and evaluating the performance in the desired setting.
However, unlike other downstream tasks, hate speech suffers from the lack of
large-sized, carefully curated, generalizable datasets owing to the highly
subjective nature of the task. In this paper, we first analyze the issues
surrounding hate speech detection through a data-centric lens. We then outline
a holistic framework to encapsulate the data creation pipeline across seven
broad dimensions by taking the specific example of hate speech towards sexual
minorities. We posit that practitioners would benefit from following this
framework as a form of best practice when creating hate speech datasets in the
future.
Related papers
- Hate Speech Detection in Limited Data Contexts using Synthetic Data
Generation [1.9506923346234724]
We propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts.
We present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets.
Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain.
arXiv Detail & Related papers (2023-10-04T15:10:06Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Reducing Target Group Bias in Hate Speech Detectors [56.94616390740415]
We show that text classification models trained on large publicly available datasets, may significantly under-perform on several protected groups.
We propose to perform token-level hate sense disambiguation, and utilize tokens' hate sense representations for detection.
arXiv Detail & Related papers (2021-12-07T17:49:34Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Latent Hatred: A Benchmark for Understanding Implicit Hate Speech [22.420275418616242]
This work introduces a theoretically-justified taxonomy of implicit hate speech and a benchmark corpus with fine-grained labels for each message.
We present systematic analyses of our dataset using contemporary baselines to detect and explain implicit hate speech.
arXiv Detail & Related papers (2021-09-11T16:52:56Z) - An Information Retrieval Approach to Building Datasets for Hate Speech
Detection [3.587367153279349]
A common practice is to only annotate tweets containing known hate words''
A second challenge is that definitions of hate speech tend to be highly variable and subjective.
Our key insight is that the rarity and subjectivity of hate speech are akin to that of relevance in information retrieval (IR)
arXiv Detail & Related papers (2021-06-17T19:25:39Z) - Exploiting Unsupervised Data for Emotion Recognition in Conversations [76.01690906995286]
Emotion Recognition in Conversations (ERC) aims to predict the emotional state of speakers in conversations.
The available supervised data for the ERC task is limited.
We propose a novel approach to leverage unsupervised conversation data.
arXiv Detail & Related papers (2020-10-02T13:28:47Z) - Towards Hate Speech Detection at Large via Deep Generative Modeling [4.080068044420974]
Hate speech detection is a critical problem in social media platforms.
We present a dataset of 1 million realistic hate and non-hate sequences, produced by a deep generative language model.
We demonstrate consistent and significant performance improvements across five public hate speech datasets.
arXiv Detail & Related papers (2020-05-13T15:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.