Related papers: Detection of Offensive and Threatening Online Content in a Low Resource Language

Detection of Offensive and Threatening Online Content in a Low Resource Language

URL: http://arxiv.org/abs/2311.10541v1
Date: Fri, 17 Nov 2023 14:08:44 GMT
Title: Detection of Offensive and Threatening Online Content in a Low Resource Language
Authors: Fatima Muhammad Adam, Abubakar Yakubu Zandam, Isa Inuwa-Dutse
Abstract summary: Hausa is a major Chadic language, spoken by over 100 million people in Africa. Online platforms often facilitate social interactions that can lead to the use of offensive and threatening language.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hausa is a major Chadic language, spoken by over 100 million people in Africa. However, from a computational linguistic perspective, it is considered a low-resource language, with limited resources to support Natural Language Processing (NLP) tasks. Online platforms often facilitate social interactions that can lead to the use of offensive and threatening language, which can go undetected due to the lack of detection systems designed for Hausa. This study aimed to address this issue by (1) conducting two user studies (n=308) to investigate cyberbullying-related issues, (2) collecting and annotating the first set of offensive and threatening datasets to support relevant downstream tasks in Hausa, (3) developing a detection system to flag offensive and threatening content, and (4) evaluating the detection system and the efficacy of the Google-based translation engine in detecting offensive and threatening terms in Hausa. We found that offensive and threatening content is quite common, particularly when discussing religion and politics. Our detection system was able to detect more than 70% of offensive and threatening content, although many of these were mistranslated by Google's translation engine. We attribute this to the subtle relationship between offensive and threatening content and idiomatic expressions in the Hausa language. We recommend that diverse stakeholders participate in understanding local conventions and demographics in order to develop a more effective detection system. These insights are essential for implementing targeted moderation strategies to create a safe and inclusive online environment.

Related papers

Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects [0.6554326244334868]
This review highlights the scarcity of annotated corpora, limited availability of pre-trained language models, and linguistic challenges like tonal complexity and diacritic dependency as significant obstacles. The findings reveal a growing body of multilingual and monolingual resources, even though the field is constrained by socio-cultural factors such as code-switching and desertion of language for digital usage.
arXiv Detail & Related papers (2025-02-24T17:41:48Z)
Exploring transfer learning for Deep NLP systems on rarely annotated languages [0.0]
This thesis investigates the application of transfer learning for Part-of-Speech (POS) tagging between Hindi and Nepali. We assess whether multitask learning in Hindi, with auxiliary tasks such as gender and singular/plural tagging, can contribute to improved POS tagging accuracy.
arXiv Detail & Related papers (2024-10-15T13:33:54Z)
A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages [0.0]
This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo. We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers. We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90%.
arXiv Detail & Related papers (2024-06-04T09:58:29Z)
Backdoor Attack on Multilingual Machine Translation [53.28390057407576]
multilingual machine translation (MNMT) systems have security vulnerabilities. An attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages. This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings.
arXiv Detail & Related papers (2024-04-03T01:32:31Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Cyberbullying Detection for Low-resource Languages and Dialects: Review of the State of the Art [0.9831489366502298]
There are 23 low-resource languages and dialects covered by this paper, including Bangla, Hindi, Dravidian languages and others. In the survey, we identify some of the research gaps of previous studies, which include the lack of reliable definitions of cyberbullying. Based on those proposed suggestions, we collect and release a cyberbullying dataset in the Chittagonian dialect of Bangla.
arXiv Detail & Related papers (2023-08-30T03:52:28Z)
Fine-Tuning Llama 2 Large Language Models for Detecting Online Sexual Predatory Chats and Abusive Texts [2.406214748890827]
This paper proposes an approach to detection of online sexual predatory chats and abusive language using the open-source pretrained Llama 2 7B- parameter model. We fine-tune the LLM using datasets with different sizes, imbalance degrees, and languages (i.e., English, Roman Urdu and Urdu) Experimental results show a strong performance of the proposed approach, which performs proficiently and consistently across three distinct datasets.
arXiv Detail & Related papers (2023-08-28T16:18:50Z)
Countering Malicious Content Moderation Evasion in Online Social Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems. This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z)
Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language. We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening. For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z)
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z)
COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences. We also propose textscCOLDetector to study output offensiveness of popular Chinese language models. Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z)
Abusive and Threatening Language Detection in Urdu using Boosting based and BERT based models: A Comparative Approach [0.0]
In this paper, we explore several machine learning models for abusive and threatening content detection in Urdu based on the shared task. Our model came First for both abusive and threatening content detection with an F1scoreof 0.88 and 0.54, respectively.
arXiv Detail & Related papers (2021-11-27T20:03:19Z)
Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages. We infer this distribution from a sample of typologically diverse training languages. We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models. Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
Offensive Language Detection: A Comparative Analysis [2.5739449801033842]
We explore the effectiveness of Google sentence encoder, Fasttext, Dynamic mode decomposition (DMD) based features and Random kitchen sink (RKS) method for offensive language detection. From the experiments and evaluation we observed that RKS with fastetxt achieved competing results.
arXiv Detail & Related papers (2020-01-09T17:48:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.