Arabic Offensive Language on Twitter: Analysis and Experiments
- URL: http://arxiv.org/abs/2004.02192v3
- Date: Tue, 9 Mar 2021 20:22:18 GMT
- Title: Arabic Offensive Language on Twitter: Analysis and Experiments
- Authors: Hamdy Mubarak, Ammar Rashed, Kareem Darwish, Younes Samih, Ahmed
Abdelali
- Abstract summary: We introduce a method for building a dataset that is not biased by topic, dialect, or target.
We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech.
- Score: 9.879488163141813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting offensive language on Twitter has many applications ranging from
detecting/predicting bullying to measuring polarization. In this paper, we
focus on building a large Arabic offensive tweet dataset. We introduce a method
for building a dataset that is not biased by topic, dialect, or target. We
produce the largest Arabic dataset to date with special tags for vulgarity and
hate speech. We thoroughly analyze the dataset to determine which topics,
dialects, and gender are most associated with offensive tweets and how Arabic
speakers use offensive language. Lastly, we conduct many experiments to produce
strong results (F1 = 83.2) on the dataset using SOTA techniques.
Related papers
- A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages [0.0]
This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo.
We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers.
We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90%.
arXiv Detail & Related papers (2024-06-04T09:58:29Z) - How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have [58.23138483086277]
In this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection.
Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain.
Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages.
arXiv Detail & Related papers (2023-05-23T14:04:12Z) - KOLD: Korean Offensive Language Dataset [11.699797031874233]
We present a Korean offensive language dataset (KOLD), 40k comments labeled with offensiveness, target, and targeted group information.
We show that title information serves as context and is helpful to discern the target of hatred, especially when they are omitted in the comment.
arXiv Detail & Related papers (2022-05-23T13:58:45Z) - NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual
Sentiment Analysis [5.048355865260207]
We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria.
The dataset consists of around 30,000 annotated tweets per language.
We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.
arXiv Detail & Related papers (2022-01-20T16:28:06Z) - Emojis as Anchors to Detect Arabic Offensive Language and Hate Speech [6.1875341699258595]
We introduce a generic, language-independent method to collect a large percentage of offensive and hate tweets.
We harness the extralinguistic information embedded in the emojis to collect a large number of offensive tweets.
arXiv Detail & Related papers (2022-01-18T03:56:57Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Automatic Expansion and Retargeting of Arabic Offensive Language
Training [12.111859709582617]
We employ two key insights, namely that replies on Twitter often imply opposition and some accounts are persistent in their offensiveness towards specific targets.
We show the efficacy of the approach on Arabic tweets with 13% and 79% relative F1-measure improvement in entity specific offensive language detection.
arXiv Detail & Related papers (2021-11-18T08:25:09Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Improving Sentiment Analysis over non-English Tweets using Multilingual
Transformers and Automatic Translation for Data-Augmentation [77.69102711230248]
We propose the use of a multilingual transformer model, that we pre-train over English tweets and apply data-augmentation using automatic translation to adapt the model to non-English languages.
Our experiments in French, Spanish, German and Italian suggest that the proposed technique is an efficient way to improve the results of the transformers over small corpora of tweets in a non-English language.
arXiv Detail & Related papers (2020-10-07T15:44:55Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.