Cyberbullying Detection for Low-resource Languages and Dialects: Review
of the State of the Art
- URL: http://arxiv.org/abs/2308.15745v1
- Date: Wed, 30 Aug 2023 03:52:28 GMT
- Title: Cyberbullying Detection for Low-resource Languages and Dialects: Review
of the State of the Art
- Authors: Tanjim Mahmud, Michal Ptaszynski, Juuso Eronen and Fumito Masui
- Abstract summary: There are 23 low-resource languages and dialects covered by this paper, including Bangla, Hindi, Dravidian languages and others.
In the survey, we identify some of the research gaps of previous studies, which include the lack of reliable definitions of cyberbullying.
Based on those proposed suggestions, we collect and release a cyberbullying dataset in the Chittagonian dialect of Bangla.
- Score: 0.9831489366502298
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The struggle of social media platforms to moderate content in a timely
manner, encourages users to abuse such platforms to spread vulgar or abusive
language, which, when performed repeatedly becomes cyberbullying a social
problem taking place in virtual environments, yet with real-world consequences,
such as depression, withdrawal, or even suicide attempts of its victims.
Systems for the automatic detection and mitigation of cyberbullying have been
developed but, unfortunately, the vast majority of them are for the English
language, with only a handful available for low-resource languages. To estimate
the present state of research and recognize the needs for further development,
in this paper we present a comprehensive systematic survey of studies done so
far for automatic cyberbullying detection in low-resource languages. We
analyzed all studies on this topic that were available. We investigated more
than seventy published studies on automatic detection of cyberbullying or
related language in low-resource languages and dialects that were published
between around 2017 and January 2023. There are 23 low-resource languages and
dialects covered by this paper, including Bangla, Hindi, Dravidian languages
and others. In the survey, we identify some of the research gaps of previous
studies, which include the lack of reliable definitions of cyberbullying and
its relevant subcategories, biases in the acquisition, and annotation of data.
Based on recognizing those research gaps, we provide some suggestions for
improving the general research conduct in cyberbullying detection, with a
primary focus on low-resource languages. Based on those proposed suggestions,
we collect and release a cyberbullying dataset in the Chittagonian dialect of
Bangla and propose a number of initial ML solutions trained on that dataset. In
addition, pre-trained transformer-based the BanglaBERT model was also
attempted.
Related papers
- The Use of a Large Language Model for Cyberbullying Detection [0.0]
cyberbullying (CB) is the most prevalent phenomenon in todays cyber world.
It is a severe threat to the mental and physical health of citizens.
This opens the need to develop a robust system to prevent bullying content from online forums, blogs, and social media platforms.
arXiv Detail & Related papers (2024-02-06T15:46:31Z) - Explain Thyself Bully: Sentiment Aided Cyberbullying Detection with
Explanation [52.3781496277104]
Cyberbullying has become a big issue with the popularity of different social media networks and online communication apps.
Recent laws like "right to explanations" of General Data Protection Regulation have spurred research in developing interpretable models.
We develop first interpretable multi-task model called em mExCB for automatic cyberbullying detection from code-mixed languages.
arXiv Detail & Related papers (2024-01-17T07:36:22Z) - Detection of Offensive and Threatening Online Content in a Low Resource
Language [0.0]
Hausa is a major Chadic language, spoken by over 100 million people in Africa.
Online platforms often facilitate social interactions that can lead to the use of offensive and threatening language.
arXiv Detail & Related papers (2023-11-17T14:08:44Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Fine-Tuning Llama 2 Large Language Models for Detecting Online Sexual
Predatory Chats and Abusive Texts [2.406214748890827]
This paper proposes an approach to detection of online sexual predatory chats and abusive language using the open-source pretrained Llama 2 7B- parameter model.
We fine-tune the LLM using datasets with different sizes, imbalance degrees, and languages (i.e., English, Roman Urdu and Urdu)
Experimental results show a strong performance of the proposed approach, which performs proficiently and consistently across three distinct datasets.
arXiv Detail & Related papers (2023-08-28T16:18:50Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Initial Study into Application of Feature Density and
Linguistically-backed Embedding to Improve Machine Learning-based
Cyberbullying Detection [54.83707803301847]
The research was conducted on a Formspring dataset provided in a Kaggle competition on automatic cyberbullying detection.
The study confirmed the effectiveness of Neural Networks in cyberbullying detection and the correlation between classifier performance and Feature Density.
arXiv Detail & Related papers (2022-06-04T03:17:15Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Abusive and Threatening Language Detection in Urdu using Boosting based
and BERT based models: A Comparative Approach [0.0]
In this paper, we explore several machine learning models for abusive and threatening content detection in Urdu based on the shared task.
Our model came First for both abusive and threatening content detection with an F1scoreof 0.88 and 0.54, respectively.
arXiv Detail & Related papers (2021-11-27T20:03:19Z) - Cyberbullying Detection Using Deep Neural Network from Social Media
Comments in Bangla Language [0.0]
We have proposed binary and multiclass classification model using hybrid neural network for bully expression detection in Bengali language.
We have used 44,001 users comments from popular public Facebook pages, which fall into five classes - Non-bully, Sexual, Threat, Troll and Religious.
Our binary classification model gives 87.91% accuracy, whereas introducing ensemble technique after neural network for multiclass classification, we got 85% accuracy.
arXiv Detail & Related papers (2021-06-08T16:47:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.