Abusive and Threatening Language Detection in Urdu using Boosting based
and BERT based models: A Comparative Approach
- URL: http://arxiv.org/abs/2111.14830v1
- Date: Sat, 27 Nov 2021 20:03:19 GMT
- Title: Abusive and Threatening Language Detection in Urdu using Boosting based
and BERT based models: A Comparative Approach
- Authors: Mithun Das, Somnath Banerjee, Punyajoy Saha
- Abstract summary: In this paper, we explore several machine learning models for abusive and threatening content detection in Urdu based on the shared task.
Our model came First for both abusive and threatening content detection with an F1scoreof 0.88 and 0.54, respectively.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online hatred is a growing concern on many social media platforms. To address
this issue, different social media platforms have introduced moderation
policies for such content. They also employ moderators who can check the posts
violating moderation policies and take appropriate action. Academicians in the
abusive language research domain also perform various studies to detect such
content better. Although there is extensive research in abusive language
detection in English, there is a lacuna in abusive language detection in low
resource languages like Hindi, Urdu etc. In this FIRE 2021 shared task -
"HASOC- Abusive and Threatening language detection in Urdu" the organizers
propose an abusive language detection dataset in Urdu along with threatening
language detection. In this paper, we explored several machine learning models
such as XGboost, LGBM, m-BERT based models for abusive and threatening content
detection in Urdu based on the shared task. We observed the Transformer model
specifically trained on abusive language dataset in Arabic helps in getting the
best performance. Our model came First for both abusive and threatening content
detection with an F1scoreof 0.88 and 0.54, respectively.
Related papers
- Fine-Tuning Llama 2 Large Language Models for Detecting Online Sexual
Predatory Chats and Abusive Texts [2.406214748890827]
This paper proposes an approach to detection of online sexual predatory chats and abusive language using the open-source pretrained Llama 2 7B- parameter model.
We fine-tune the LLM using datasets with different sizes, imbalance degrees, and languages (i.e., English, Roman Urdu and Urdu)
Experimental results show a strong performance of the proposed approach, which performs proficiently and consistently across three distinct datasets.
arXiv Detail & Related papers (2023-08-28T16:18:50Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - Data Bootstrapping Approaches to Improve Low Resource Abusive Language
Detection for Indic Languages [5.51252705016179]
We demonstrate a large-scale analysis of multilingual abusive speech in Indic languages.
We examine different interlingual transfer mechanisms and observe the performance of various multilingual models for abusive speech detection.
arXiv Detail & Related papers (2022-04-26T18:56:01Z) - LaMDA: Language Models for Dialog Applications [75.75051929981933]
LaMDA is a family of Transformer-based neural language models specialized for dialog.
Fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements.
arXiv Detail & Related papers (2022-01-20T15:44:37Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - Hostility Detection in Hindi leveraging Pre-Trained Language Models [1.6436293069942312]
This paper presents a transfer learning based approach to classify social media posts in Hindi Devanagari script as Hostile or Non-Hostile.
Hostile posts are further analyzed to determine if they are Hateful, Fake, Defamation, and Offensive.
We establish a robust and consistent model without any ensembling or complex pre-processing.
arXiv Detail & Related papers (2021-01-14T08:04:32Z) - Detecting Social Media Manipulation in Low-Resource Languages [29.086752995321724]
Malicious actors share content across countries and languages, including low-resource ones.
We investigate whether and to what extent malicious actors can be detected in low-resource language settings.
By combining text embedding and transfer learning, our framework can detect, with promising accuracy, malicious users posting in Tagalog.
arXiv Detail & Related papers (2020-11-10T19:38:03Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.