Detection of Offensive and Threatening Online Content in a Low Resource
Language
- URL: http://arxiv.org/abs/2311.10541v1
- Date: Fri, 17 Nov 2023 14:08:44 GMT
- Title: Detection of Offensive and Threatening Online Content in a Low Resource
Language
- Authors: Fatima Muhammad Adam, Abubakar Yakubu Zandam, Isa Inuwa-Dutse
- Abstract summary: Hausa is a major Chadic language, spoken by over 100 million people in Africa.
Online platforms often facilitate social interactions that can lead to the use of offensive and threatening language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hausa is a major Chadic language, spoken by over 100 million people in
Africa. However, from a computational linguistic perspective, it is considered
a low-resource language, with limited resources to support Natural Language
Processing (NLP) tasks. Online platforms often facilitate social interactions
that can lead to the use of offensive and threatening language, which can go
undetected due to the lack of detection systems designed for Hausa. This study
aimed to address this issue by (1) conducting two user studies (n=308) to
investigate cyberbullying-related issues, (2) collecting and annotating the
first set of offensive and threatening datasets to support relevant downstream
tasks in Hausa, (3) developing a detection system to flag offensive and
threatening content, and (4) evaluating the detection system and the efficacy
of the Google-based translation engine in detecting offensive and threatening
terms in Hausa. We found that offensive and threatening content is quite
common, particularly when discussing religion and politics. Our detection
system was able to detect more than 70% of offensive and threatening content,
although many of these were mistranslated by Google's translation engine. We
attribute this to the subtle relationship between offensive and threatening
content and idiomatic expressions in the Hausa language. We recommend that
diverse stakeholders participate in understanding local conventions and
demographics in order to develop a more effective detection system. These
insights are essential for implementing targeted moderation strategies to
create a safe and inclusive online environment.
Related papers
- Backdoor Attack on Multilingual Machine Translation [53.28390057407576]
multilingual machine translation (MNMT) systems have security vulnerabilities.
An attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages.
This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings.
arXiv Detail & Related papers (2024-04-03T01:32:31Z) - Cyberbullying Detection for Low-resource Languages and Dialects: Review
of the State of the Art [0.9831489366502298]
There are 23 low-resource languages and dialects covered by this paper, including Bangla, Hindi, Dravidian languages and others.
In the survey, we identify some of the research gaps of previous studies, which include the lack of reliable definitions of cyberbullying.
Based on those proposed suggestions, we collect and release a cyberbullying dataset in the Chittagonian dialect of Bangla.
arXiv Detail & Related papers (2023-08-30T03:52:28Z) - Fine-Tuning Llama 2 Large Language Models for Detecting Online Sexual
Predatory Chats and Abusive Texts [2.406214748890827]
This paper proposes an approach to detection of online sexual predatory chats and abusive language using the open-source pretrained Llama 2 7B- parameter model.
We fine-tune the LLM using datasets with different sizes, imbalance degrees, and languages (i.e., English, Roman Urdu and Urdu)
Experimental results show a strong performance of the proposed approach, which performs proficiently and consistently across three distinct datasets.
arXiv Detail & Related papers (2023-08-28T16:18:50Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Abusive and Threatening Language Detection in Urdu using Boosting based
and BERT based models: A Comparative Approach [0.0]
In this paper, we explore several machine learning models for abusive and threatening content detection in Urdu based on the shared task.
Our model came First for both abusive and threatening content detection with an F1scoreof 0.88 and 0.54, respectively.
arXiv Detail & Related papers (2021-11-27T20:03:19Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z) - Offensive Language Detection: A Comparative Analysis [2.5739449801033842]
We explore the effectiveness of Google sentence encoder, Fasttext, Dynamic mode decomposition (DMD) based features and Random kitchen sink (RKS) method for offensive language detection.
From the experiments and evaluation we observed that RKS with fastetxt achieved competing results.
arXiv Detail & Related papers (2020-01-09T17:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.