Code-Mixed Telugu-English Hate Speech Detection
- URL: http://arxiv.org/abs/2502.10632v1
- Date: Sat, 15 Feb 2025 02:03:13 GMT
- Title: Code-Mixed Telugu-English Hate Speech Detection
- Authors: Santhosh Kakarla, Gautama Shastry Bulusu Venkata,
- Abstract summary: This study investigates transformer-based models, including TeluguHateBERT, HateBERT, DeBERTa, Muril, IndicBERT, Roberta, and Hindi-Abusive-MuRIL, for classifying hate speech in Telugu.
We fine-tune these models using Low-Rank Adaptation (LoRA) to optimize efficiency and performance.
We translate Telugu text into English using Google Translate to assess its impact on classification accuracy.
- Score: 0.0
- License:
- Abstract: Hate speech detection in low-resource languages like Telugu is a growing challenge in NLP. This study investigates transformer-based models, including TeluguHateBERT, HateBERT, DeBERTa, Muril, IndicBERT, Roberta, and Hindi-Abusive-MuRIL, for classifying hate speech in Telugu. We fine-tune these models using Low-Rank Adaptation (LoRA) to optimize efficiency and performance. Additionally, we explore a multilingual approach by translating Telugu text into English using Google Translate to assess its impact on classification accuracy. Our experiments reveal that most models show improved performance after translation, with DeBERTa and Hindi-Abusive-MuRIL achieving higher accuracy and F1 scores compared to training directly on Telugu text. Notably, Hindi-Abusive-MuRIL outperforms all other models in both the original Telugu dataset and the translated dataset, demonstrating its robustness across different linguistic settings. This suggests that translation enables models to leverage richer linguistic features available in English, leading to improved classification performance. The results indicate that multilingual processing can be an effective approach for hate speech detection in low-resource languages. These findings demonstrate that transformer models, when fine-tuned appropriately, can significantly improve hate speech detection in Telugu, paving the way for more robust multilingual NLP applications.
Related papers
- Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning [0.4194295877935868]
This study investigates the effects of Low-Rank Adaptation (LoRA) -Efficient Fine-Tuning (PEFT) on multilingual Gemma models for Marathi.
Using a translated dataset with 52,000 instruction-response pairs, our findings reveal that while evaluation performance decline post-fine-tuning, manual assessments frequently suggest that the fine-tuned models outperform their original counterparts.
arXiv Detail & Related papers (2024-11-27T18:14:38Z) - Impact of Tokenization on LLaMa Russian Adaptation [0.0]
We investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation.
The results of automatic evaluation show that vocabulary substitution improves the model's quality in Russian.
Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference.
arXiv Detail & Related papers (2023-12-05T09:16:03Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Evaluating the Effectiveness of Natural Language Inference for Hate
Speech Detection in Languages with Limited Labeled Data [2.064612766965483]
Natural language inference (NLI) models which perform well in zero- and few-shot settings can benefit hate speech detection performance.
Our evaluation on five languages demonstrates large performance improvements of NLI fine-tuning over direct fine-tuning in the target language.
arXiv Detail & Related papers (2023-06-06T14:40:41Z) - Analyzing and Reducing the Performance Gap in Cross-Lingual Transfer
with Fine-tuning Slow and Fast [50.19681990847589]
Existing research has shown that a multilingual pre-trained language model fine-tuned with one (source) language also performs well on downstream tasks for non-source languages.
This paper analyzes the fine-tuning process, discovers when the performance gap changes and identifies which network weights affect the overall performance most.
arXiv Detail & Related papers (2023-05-19T06:04:21Z) - Data-Efficient Strategies for Expanding Hate Speech Detection into
Under-Resourced Languages [35.185808055004344]
Most hate speech datasets so far focus on English-language content.
More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators.
We explore data-efficient strategies for expanding hate speech detection into under-resourced languages.
arXiv Detail & Related papers (2022-10-20T15:49:00Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Deep Learning Models for Multilingual Hate Speech Detection [5.977278650516324]
In this paper, we conduct a large scale analysis of multilingual hate speech in 9 languages from 16 different sources.
We observe that in low resource setting, simple models such as LASER embedding with logistic regression performs the best.
In case of zero-shot classification, languages such as Italian and Portuguese achieve good results.
arXiv Detail & Related papers (2020-04-14T13:14:27Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.