Hate Speech Detection in Roman Urdu
- URL: http://arxiv.org/abs/2108.02830v1
- Date: Thu, 5 Aug 2021 19:49:46 GMT
- Title: Hate Speech Detection in Roman Urdu
- Authors: Moin Khan, Khurram Shahzad, Kamran Malik
- Abstract summary: This study is the first to conduct a study for hate speech detection in Roman Urdu text.
We have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets.
We have employed an iterative approach to develop guidelines and used them for generating the Hate Speech Roman Urdu 2020 corpus.
- Score: 1.6436293069942314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hate speech is a specific type of controversial content that is widely
legislated as a crime that must be identified and blocked. However, due to the
sheer volume and velocity of the Twitter data stream, hate speech detection
cannot be performed manually. To address this issue, several studies have been
conducted for hate speech detection in European languages, whereas little
attention has been paid to low-resource South Asian languages, making the
social media vulnerable for millions of users. In particular, to the best of
our knowledge, no study has been conducted for hate speech detection in Roman
Urdu text, which is widely used in the sub-continent. In this study, we have
scrapped more than 90,000 tweets and manually parsed them to identify 5,000
Roman Urdu tweets. Subsequently, we have employed an iterative approach to
develop guidelines and used them for generating the Hate Speech Roman Urdu 2020
corpus. The tweets in the this corpus are classified at three levels:
Neutral-Hostile, Simple-Complex, and Offensive-Hate speech. As another
contribution, we have used five supervised learning techniques, including a
deep learning technique, to evaluate and compare their effectiveness for hate
speech detection. The results show that Logistic Regression outperformed all
other techniques, including deep learning techniques for the two levels of
classification, by achieved an F1 score of 0.906 for distinguishing between
Neutral-Hostile tweets, and 0.756 for distinguishing between Offensive-Hate
speech tweets.
Related papers
- Hate Speech Detection and Classification in Amharic Text with Deep Learning [4.834669033093363]
We develop Amharic hate speech data and SBi-LSTM deep learning model that can detect and classify text into four categories of hate speech.
We have annotated 5k Amharic social media post and comment data into four categories.
The model achieves a 94.8 F1-score performance.
arXiv Detail & Related papers (2024-08-07T15:46:45Z) - Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts [0.0]
We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts.
The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate.
The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level.
arXiv Detail & Related papers (2024-04-30T04:16:55Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - Assessing the impact of contextual information in hate speech detection [0.48369513656026514]
We provide a novel corpus for contextualized hate speech detection based on user responses to news posts from media outlets on Twitter.
This corpus was collected in the Rioplatense dialectal variety of Spanish and focuses on hate speech associated with the COVID-19 pandemic.
arXiv Detail & Related papers (2022-10-02T09:04:47Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z) - Hate versus Politics: Detection of Hate against Policy makers in Italian
tweets [0.6289422225292998]
This paper addresses the issue of classification of hate speech against policy makers from Twitter in Italian.
We collected and annotated 1264 tweets, examined the cases of disagreements between annotators, and performed in-domain and cross-domain hate speech classifications.
We achieved a performance of ROC AUC 0.83 and analyzed the most predictive attributes, also finding the different language features in the anti-policymakers and anti-immigration domains.
arXiv Detail & Related papers (2021-07-12T12:24:45Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media
during the COVID-19 Crisis [51.39895377836919]
COVID-19 has sparked racism and hate on social media targeted towards Asian communities.
We study the evolution and spread of anti-Asian hate speech through the lens of Twitter.
We create COVID-HATE, the largest dataset of anti-Asian hate and counterspeech spanning 14 months.
arXiv Detail & Related papers (2020-05-25T21:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.