Fine-Tuning Approach for Arabic Offensive Language Detection System:
BERT-Based Model
- URL: http://arxiv.org/abs/2203.03542v1
- Date: Mon, 7 Feb 2022 17:26:35 GMT
- Title: Fine-Tuning Approach for Arabic Offensive Language Detection System:
BERT-Based Model
- Authors: Fatemah Husain and Ozlem Uzuner
- Abstract summary: This study investigates the effects of fine-tuning across several Arabic offensive language datasets.
We develop multiple classifiers that use four datasets individually and in combination to gain knowledge about online Arabic offensive content.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of online offensive language limits the health and security of
online users. It is essential to apply the latest state-of-the-art techniques
in developing a system to detect online offensive language and to ensure social
justice to the online communities. Our study investigates the effects of
fine-tuning across several Arabic offensive language datasets. We develop
multiple classifiers that use four datasets individually and in combination in
order to gain knowledge about online Arabic offensive content and classify
users comments accordingly. Our results demonstrate the limited effects of
transfer learning on the classifiers performance, particularly for highly
dialectal comments.
Related papers
- QiBERT -- Classifying Online Conversations Messages with BERT as a Feature [0.0]
This paper aims to use data obtained from online social conversations in Portuguese schools to observe behavioural trends.
This project used the state of the art (SoA) Machine Learning (ML) algorithms and methods, through BERT based models to classify if utterances are in or out of the debate subject.
arXiv Detail & Related papers (2024-09-09T11:38:06Z) - Offensive Language Identification in Transliterated and Code-Mixed
Bangla [29.30985521838655]
In this paper, we explore offensive language identification in texts with transliterations and code-mixing.
We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments.
We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset.
arXiv Detail & Related papers (2023-11-25T13:27:22Z) - KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased
Speech in Real-World Online Services [5.03606775899383]
"KoMultiText" is a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform.
Our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics.
Our work can provide solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health.
arXiv Detail & Related papers (2023-10-06T15:19:39Z) - Fine-Tuning Llama 2 Large Language Models for Detecting Online Sexual
Predatory Chats and Abusive Texts [2.406214748890827]
This paper proposes an approach to detection of online sexual predatory chats and abusive language using the open-source pretrained Llama 2 7B- parameter model.
We fine-tune the LLM using datasets with different sizes, imbalance degrees, and languages (i.e., English, Roman Urdu and Urdu)
Experimental results show a strong performance of the proposed approach, which performs proficiently and consistently across three distinct datasets.
arXiv Detail & Related papers (2023-08-28T16:18:50Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Transfer Learning Approach for Arabic Offensive Language Detection
System -- BERT-Based Model [0.0]
Cyberhate, online harassment and other misuses of technology are on the rise.
Applying advanced techniques from the Natural Language Processing (NLP) field to support the development of an online hate-free community is a critical task for social justice.
This study aims at investigating the effects of fine-tuning and training Bidirectional Representations from Transformers (BERT) model on multiple Arabic offensive language datasets individually.
arXiv Detail & Related papers (2021-02-09T04:58:18Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.