Hate Speech and Offensive Content Detection in Indo-Aryan Languages: A
Battle of LSTM and Transformers
- URL: http://arxiv.org/abs/2312.05671v1
- Date: Sat, 9 Dec 2023 20:24:00 GMT
- Title: Hate Speech and Offensive Content Detection in Indo-Aryan Languages: A
Battle of LSTM and Transformers
- Authors: Nikhil Narayan, Mrutyunjay Biswal, Pramod Goyal, Abhranta Panigrahi
- Abstract summary: We conduct a comparative analysis of hate speech classification across five distinct languages: Bengali, Assamese, Bodo, Sinhala, and Gujarati.
Bert Base Multilingual Cased emerges as a strong performer across languages, achieving an F1 score of 0.67027 for Bengali and 0.70525 for Assamese.
In Sinhala, XLM-R stands out with an F1 score of 0.83493, whereas for Gujarati, a custom LSTM-based model outshined with an F1 score of 0.76601.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social media platforms serve as accessible outlets for individuals to express
their thoughts and experiences, resulting in an influx of user-generated data
spanning all age groups. While these platforms enable free expression, they
also present significant challenges, including the proliferation of hate speech
and offensive content. Such objectionable language disrupts objective discourse
and can lead to radicalization of debates, ultimately threatening democratic
values. Consequently, organizations have taken steps to monitor and curb
abusive behavior, necessitating automated methods for identifying suspicious
posts. This paper contributes to Hate Speech and Offensive Content
Identification in English and Indo-Aryan Languages (HASOC) 2023 shared tasks
track. We, team Z-AGI Labs, conduct a comprehensive comparative analysis of
hate speech classification across five distinct languages: Bengali, Assamese,
Bodo, Sinhala, and Gujarati. Our study encompasses a wide range of pre-trained
models, including Bert variants, XLM-R, and LSTM models, to assess their
performance in identifying hate speech across these languages. Results reveal
intriguing variations in model performance. Notably, Bert Base Multilingual
Cased emerges as a strong performer across languages, achieving an F1 score of
0.67027 for Bengali and 0.70525 for Assamese. At the same time, it
significantly outperforms other models with an impressive F1 score of 0.83009
for Bodo. In Sinhala, XLM-R stands out with an F1 score of 0.83493, whereas for
Gujarati, a custom LSTM-based model outshined with an F1 score of 0.76601. This
study offers valuable insights into the suitability of various pre-trained
models for hate speech detection in multilingual settings. By considering the
nuances of each, our research contributes to an informed model selection for
building robust hate speech detection systems.
Related papers
- HateTinyLLM : Hate Speech Detection Using Tiny Large Language Models [0.0]
Hate speech encompasses verbal, written, or behavioral communication that targets derogatory or discriminatory language against individuals or groups.
HateTinyLLM is a novel framework based on fine-tuned decoder-only tiny large language models (tinyLLMs) for efficient hate speech detection.
arXiv Detail & Related papers (2024-04-26T05:29:35Z) - Analysis and Detection of Multilingual Hate Speech Using Transformer
Based Deep Learning [7.332311991395427]
As the prevalence of hate speech increases online, the demand for automated detection as an NLP task is increasing.
In this work, the proposed method is using transformer-based model to detect hate speech in social media, like twitter, Facebook, WhatsApp, Instagram, etc.
The Gold standard datasets were collected from renowned researcher Zeerak Talat, Sara Tonelli, Melanie Siegel, and Rezaul Karim.
The success rate of the proposed model for hate speech detection is higher than the existing baseline and state-of-the-art models with accuracy in Bengali dataset is 89%, in English: 91%, in German
arXiv Detail & Related papers (2024-01-19T20:40:23Z) - Harnessing Pre-Trained Sentence Transformers for Offensive Language
Detection in Indian Languages [0.6526824510982802]
This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati.
The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content.
We fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech.
arXiv Detail & Related papers (2023-10-03T17:53:09Z) - Bag of Tricks for Effective Language Model Pretraining and Downstream
Adaptation: A Case Study on GLUE [93.98660272309974]
This report briefly describes our submission Vega v1 on the General Language Understanding Evaluation leaderboard.
GLUE is a collection of nine natural language understanding tasks, including question answering, linguistic acceptability, sentiment analysis, text similarity, paraphrase detection, and natural language inference.
With our optimized pretraining and fine-tuning strategies, our 1.3 billion model sets new state-of-the-art on 4/9 tasks, achieving the best average score of 91.3.
arXiv Detail & Related papers (2023-02-18T09:26:35Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - Exploring Transformer Based Models to Identify Hate Speech and Offensive
Content in English and Indo-Aryan Languages [0.0]
We explore several Transformer based machine learning models for the detection of hate speech and offensive content in English and Indo-Aryan languages.
Our models came 2nd position in Code-Mixed Data set (Macro F1: 0.7107), 2nd position in Hindi two-class classification(Macro F1: 0.7797), 4th in English four-class category (Macro F1: 0.8006) and 12th in English two-class category (Macro F1: 0.6447)
arXiv Detail & Related papers (2021-11-27T19:26:14Z) - One to rule them all: Towards Joint Indic Language Hate Speech Detection [7.296361860015606]
We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection.
On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B.
arXiv Detail & Related papers (2021-09-28T13:30:00Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.