OSACT4 Shared Task on Offensive Language Detection: Intensive
Preprocessing-Based Approach
- URL: http://arxiv.org/abs/2005.07297v1
- Date: Thu, 14 May 2020 23:46:10 GMT
- Title: OSACT4 Shared Task on Offensive Language Detection: Intensive
Preprocessing-Based Approach
- Authors: Fatemah Husain
- Abstract summary: This study aims at investigating the impact of the preprocessing phase on text classification for Arabic text.
The Arabic language used in social media is informal and written using Arabic dialects, which makes the text classification task very complex.
An intensive preprocessing-based approach demonstrates its significant impact on offensive language detection and hate speech detection.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The preprocessing phase is one of the key phases within the text
classification pipeline. This study aims at investigating the impact of the
preprocessing phase on text classification, specifically on offensive language
and hate speech classification for Arabic text. The Arabic language used in
social media is informal and written using Arabic dialects, which makes the
text classification task very complex. Preprocessing helps in dimensionality
reduction and removing useless content. We apply intensive preprocessing
techniques to the dataset before processing it further and feeding it into the
classification model. An intensive preprocessing-based approach demonstrates
its significant impact on offensive language detection and hate speech
detection shared tasks of the fourth workshop on Open-Source Arabic Corpora and
Corpora Processing Tools (OSACT). Our team wins the third place (3rd) in the
Sub-Task A Offensive Language Detection division and wins the first place (1st)
in the Sub-Task B Hate Speech Detection division, with an F1 score of 89% and
95%, respectively, by providing the state-of-the-art performance in terms of
F1, accuracy, recall, and precision for Arabic hate speech detection.
Related papers
- BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla [0.0]
We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset comprising 37.3k samples.
The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups.
Experiments reveal that our further pre-trained encoders are achieving state-of-the-art performance on the BanTH dataset.
arXiv Detail & Related papers (2024-10-17T07:15:15Z) - Ensemble of pre-trained language models and data augmentation for hate speech detection from Arabic tweets [0.27309692684728604]
We propose a novel approach that leverages ensemble learning and semi-supervised learning based on previously manually labeled.
We conducted experiments on a benchmark dataset by classifying Arabic tweets into 5 distinct classes: non-hate, general hate, racial, religious, or sexism.
arXiv Detail & Related papers (2024-07-02T17:26:26Z) - Mavericks at BLP-2023 Task 1: Ensemble-based Approach Using Language
Models for Violence Inciting Text Detection [0.0]
Social media has accelerated the propagation of hate and violence-inciting speech in society.
The problem of detecting violence-inciting texts is further exacerbated in low-resource settings due to sparse research and less data.
This paper presents our work for the Violence Inciting Text Detection shared task in the First Workshop on Bangla Language Processing.
arXiv Detail & Related papers (2023-11-30T18:23:38Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Supporting Undotted Arabic with Pre-trained Language Models [0.0]
We study the effect of applying pre-trained Arabic language models on "undotted" Arabic texts.
We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-language-processing tasks.
arXiv Detail & Related papers (2021-11-18T16:47:56Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - Leveraging Multilingual Transformers for Hate Speech Detection [11.306581296760864]
We leverage state of the art Transformer language models to identify hate speech in a multilingual setting.
With a pre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identify and classify hate speech from multiple languages.
arXiv Detail & Related papers (2021-01-08T20:23:50Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.