Hate Speech Detection in Limited Data Contexts using Synthetic Data
Generation
- URL: http://arxiv.org/abs/2310.02876v1
- Date: Wed, 4 Oct 2023 15:10:06 GMT
- Title: Hate Speech Detection in Limited Data Contexts using Synthetic Data
Generation
- Authors: Aman Khullar, Daniel Nkemelu, Cuong V. Nguyen, Michael L. Best
- Abstract summary: We propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts.
We present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets.
Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain.
- Score: 1.9506923346234724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A growing body of work has focused on text classification methods for
detecting the increasing amount of hate speech posted online. This progress has
been limited to only a select number of highly-resourced languages causing
detection systems to either under-perform or not exist in limited data
contexts. This is majorly caused by a lack of training data which is expensive
to collect and curate in these settings. In this work, we propose a data
augmentation approach that addresses the problem of lack of data for online
hate speech detection in limited data contexts using synthetic data generation
techniques. Given a handful of hate speech examples in a high-resource language
such as English, we present three methods to synthesize new examples of hate
speech data in a target language that retains the hate sentiment in the
original examples but transfers the hate targets. We apply our approach to
generate training data for hate speech classification tasks in Hindi and
Vietnamese. Our findings show that a model trained on synthetic data performs
comparably to, and in some cases outperforms, a model trained only on the
samples available in the target domain. This method can be adopted to bootstrap
hate speech detection models from scratch in limited data contexts. As the
growth of social media within these contexts continues to outstrip response
efforts, this work furthers our capacities for detection, understanding, and
response to hate speech.
Related papers
- Hierarchical Sentiment Analysis Framework for Hate Speech Detection: Implementing Binary and Multiclass Classification Strategy [0.0]
We propose a new multitask model integrated with shared emotional representations to detect hate speech across the English language.
We conclude that utilizing sentiment analysis and a Transformer-based trained model considerably improves hate speech detection across multiple datasets.
arXiv Detail & Related papers (2024-11-03T04:11:33Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical
Evaluation [5.16706940452805]
We perform a large-scale cross-dataset comparison where we fine-tune language models on different hate speech detection datasets.
This analysis shows how some datasets are more generalisable than others when used as training data.
Experiments show how combining hate speech detection datasets can contribute to the development of robust hate speech detection models.
arXiv Detail & Related papers (2023-07-04T12:22:40Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - APEACH: Attacking Pejorative Expressions with Analysis on
Crowd-Generated Hate Speech Evaluation Datasets [4.034948808542701]
APEACH is a method that allows the collection of hate speech generated by unspecified users.
By controlling the crowd-generation of hate speech and adding only a minimum post-labeling, we create a corpus that enables the generalizable and fair evaluation of hate speech detection.
arXiv Detail & Related papers (2022-02-25T02:04:38Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Towards Hate Speech Detection at Large via Deep Generative Modeling [4.080068044420974]
Hate speech detection is a critical problem in social media platforms.
We present a dataset of 1 million realistic hate and non-hate sequences, produced by a deep generative language model.
We demonstrate consistent and significant performance improvements across five public hate speech datasets.
arXiv Detail & Related papers (2020-05-13T15:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.