Empirical Evaluation of Public HateSpeech Datasets
- URL: http://arxiv.org/abs/2407.12018v1
- Date: Thu, 27 Jun 2024 11:20:52 GMT
- Title: Empirical Evaluation of Public HateSpeech Datasets
- Authors: Sadar Jaf, Basel Barakat,
- Abstract summary: Social media platforms are widely utilised for generating datasets employed in training and evaluating machine learning algorithms for hate speech detection.
Existing public datasets exhibit numerous limitations, hindering the effective training of these algorithms and leading to inaccurate hate speech classification.
This work aims to advance the development of more accurate and reliable machine learning models for hate speech detection.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the extensive communication benefits offered by social media platforms, numerous challenges must be addressed to ensure user safety. One of the most significant risks faced by users on these platforms is targeted hate speech. Social media platforms are widely utilised for generating datasets employed in training and evaluating machine learning algorithms for hate speech detection. However, existing public datasets exhibit numerous limitations, hindering the effective training of these algorithms and leading to inaccurate hate speech classification. This study provides a comprehensive empirical evaluation of several public datasets commonly used in automated hate speech classification. Through rigorous analysis, we present compelling evidence highlighting the limitations of current hate speech datasets. Additionally, we conduct a range of statistical analyses to elucidate the strengths and weaknesses inherent in these datasets. This work aims to advance the development of more accurate and reliable machine learning models for hate speech detection by addressing the dataset limitations identified.
Related papers
- Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - On the Challenges of Building Datasets for Hate Speech Detection [0.0]
We first analyze the issues surrounding hate speech detection through a data-centric lens.
We then outline a holistic framework to encapsulate the data creation pipeline across seven broad dimensions.
arXiv Detail & Related papers (2023-09-06T11:15:47Z) - Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical
Evaluation [5.16706940452805]
We perform a large-scale cross-dataset comparison where we fine-tune language models on different hate speech detection datasets.
This analysis shows how some datasets are more generalisable than others when used as training data.
Experiments show how combining hate speech detection datasets can contribute to the development of robust hate speech detection models.
arXiv Detail & Related papers (2023-07-04T12:22:40Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - Panning for gold: Lessons learned from the platform-agnostic automated
detection of political content in textual data [48.7576911714538]
We discuss how these techniques can be used to detect political content across different platforms.
We compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks.
Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models.
arXiv Detail & Related papers (2022-07-01T15:23:23Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Attribute Inference Attack of Speech Emotion Recognition in Federated
Learning Settings [56.93025161787725]
Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing local data.
We propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters.
We show that the attribute inference attack is achievable for SER systems trained using FL.
arXiv Detail & Related papers (2021-12-26T16:50:42Z) - Leveraging cross-platform data to improve automated hate speech
detection [0.0]
Most existing approaches for hate speech detection focus on a single social media platform in isolation.
Here we propose a new cross-platform approach to detect hate speech which leverages multiple datasets and classification models from different platforms.
We demonstrate how this approach outperforms existing models, and achieves good performance when tested on messages from novel social media platforms.
arXiv Detail & Related papers (2021-02-09T15:52:34Z) - Towards Hate Speech Detection at Large via Deep Generative Modeling [4.080068044420974]
Hate speech detection is a critical problem in social media platforms.
We present a dataset of 1 million realistic hate and non-hate sequences, produced by a deep generative language model.
We demonstrate consistent and significant performance improvements across five public hate speech datasets.
arXiv Detail & Related papers (2020-05-13T15:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.