Uncovering Political Hate Speech During Indian Election Campaign: A New
Low-Resource Dataset and Baselines
- URL: http://arxiv.org/abs/2306.14764v2
- Date: Tue, 27 Jun 2023 16:55:14 GMT
- Title: Uncovering Political Hate Speech During Indian Election Campaign: A New
Low-Resource Dataset and Baselines
- Authors: Farhan Ahmad Jafri, Mohammad Aman Siddiqui, Surendrabikram Thapa,
Kritesh Rauniyar, Usman Naseem, Imran Razzak
- Abstract summary: IEHate dataset contains 11,457 manually annotated Hindi tweets related to the Indian Assembly Election Campaign from November 1, 2021, to March 9, 2022.
We benchmark the dataset using a range of machine learning, deep learning, and transformer-based algorithms.
In particular, the relatively higher score of human evaluation over algorithms emphasizes the importance of utilizing both human and automated approaches for effective hate speech moderation.
- Score: 3.3228144010758593
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The detection of hate speech in political discourse is a critical issue, and
this becomes even more challenging in low-resource languages. To address this
issue, we introduce a new dataset named IEHate, which contains 11,457 manually
annotated Hindi tweets related to the Indian Assembly Election Campaign from
November 1, 2021, to March 9, 2022. We performed a detailed analysis of the
dataset, focusing on the prevalence of hate speech in political communication
and the different forms of hateful language used. Additionally, we benchmark
the dataset using a range of machine learning, deep learning, and
transformer-based algorithms. Our experiments reveal that the performance of
these models can be further improved, highlighting the need for more advanced
techniques for hate speech detection in low-resource languages. In particular,
the relatively higher score of human evaluation over algorithms emphasizes the
importance of utilizing both human and automated approaches for effective hate
speech moderation. Our IEHate dataset can serve as a valuable resource for
researchers and practitioners working on developing and evaluating hate speech
detection techniques in low-resource languages. Overall, our work underscores
the importance of addressing the challenges of identifying and mitigating hate
speech in political discourse, particularly in the context of low-resource
languages. The dataset and resources for this work are made available at
https://github.com/Farhan-jafri/Indian-Election.
Related papers
- Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Hate Speech Detection in Limited Data Contexts using Synthetic Data
Generation [1.9506923346234724]
We propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts.
We present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets.
Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain.
arXiv Detail & Related papers (2023-10-04T15:10:06Z) - LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate
Speech Identification [2.048680519934008]
We present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages.
This paper is the first to address the problem of identifying various types of hate speech in these five wide domains in these six languages.
arXiv Detail & Related papers (2023-04-03T12:03:45Z) - Data-Efficient Strategies for Expanding Hate Speech Detection into
Under-Resourced Languages [35.185808055004344]
Most hate speech datasets so far focus on English-language content.
More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators.
We explore data-efficient strategies for expanding hate speech detection into under-resourced languages.
arXiv Detail & Related papers (2022-10-20T15:49:00Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Cross-lingual hate speech detection based on multilingual
domain-specific word embeddings [4.769747792846004]
We propose to address the problem of multilingual hate speech detection from the perspective of transfer learning.
Our goal is to determine if knowledge from one particular language can be used to classify other language.
We show that the use of our simple yet specific multilingual hate representations improves classification results.
arXiv Detail & Related papers (2021-04-30T02:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.