Harnessing Pre-Trained Sentence Transformers for Offensive Language
Detection in Indian Languages
- URL: http://arxiv.org/abs/2310.02249v1
- Date: Tue, 3 Oct 2023 17:53:09 GMT
- Title: Harnessing Pre-Trained Sentence Transformers for Offensive Language
Detection in Indian Languages
- Authors: Ananya Joshi, Raviraj Joshi
- Abstract summary: This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati.
The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content.
We fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech.
- Score: 0.6526824510982802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In our increasingly interconnected digital world, social media platforms have
emerged as powerful channels for the dissemination of hate speech and offensive
content. This work delves into the domain of hate speech detection, placing
specific emphasis on three low-resource Indian languages: Bengali, Assamese,
and Gujarati. The challenge is framed as a text classification task, aimed at
discerning whether a tweet contains offensive or non-offensive content.
Leveraging the HASOC 2023 datasets, we fine-tuned pre-trained BERT and SBERT
models to evaluate their effectiveness in identifying hate speech. Our findings
underscore the superiority of monolingual sentence-BERT models, particularly in
the Bengali language, where we achieved the highest ranking. However, the
performance in Assamese and Gujarati languages signifies ongoing opportunities
for enhancement. Our goal is to foster inclusive online spaces by countering
hate speech proliferation.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Cross-Linguistic Offensive Language Detection: BERT-Based Analysis of
Bengali, Assamese, & Bodo Conversational Hateful Content from Social Media [0.8287206589886881]
This article delves into the comprehensive results and key revelations from the HASOC-2023 offensive language identification result.
The primary emphasis is placed on the meticulous detection of hate speech within the linguistic domains of Bengali, Assamese, and Bodo.
In this work, we used BERT models, including XML-Roberta, L3-cube, IndicBERT, BenglaBERT, and BanglaHateBERT.
arXiv Detail & Related papers (2023-12-16T19:59:07Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Spread Love Not Hate: Undermining the Importance of Hateful Pre-training
for Hate Speech Detection [0.7874708385247353]
We study the effects of hateful pre-training on low resource hate speech classification tasks.
We evaluate different variations of tweet based BERT models pre-trained on hateful, non-hateful and mixed subsets of 40M tweet dataset.
We show that pre-training on non-hateful text from target domain provides similar or better results.
arXiv Detail & Related papers (2022-10-09T13:53:06Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Leveraging Transformers for Hate Speech Detection in Conversational
Code-Mixed Tweets [36.29939722039909]
This paper describes the system proposed by team MIDAS-IIITD for HASOC 2021 subtask 2.
It is one of the first shared tasks focusing on detecting hate speech from Hindi-English code-mixed conversations on Twitter.
Our best performing system, a hard voting ensemble of Indic-BERT, XLM-RoBERTa, and Multilingual BERT, achieved a macro F1 score of 0.7253.
arXiv Detail & Related papers (2021-12-18T19:27:33Z) - One to rule them all: Towards Joint Indic Language Hate Speech Detection [7.296361860015606]
We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection.
On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B.
arXiv Detail & Related papers (2021-09-28T13:30:00Z) - Evaluation of Deep Learning Models for Hostility Detection in Hindi Text [2.572404739180802]
We present approaches for hostile text detection in the Hindi language.
The proposed approaches are evaluated on the Constraint@AAAI 2021 Hindi hostility detection dataset.
We evaluate a host of deep learning approaches based on CNN, LSTM, and BERT for this multi-label classification problem.
arXiv Detail & Related papers (2021-01-11T19:10:57Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.