Emojis as Anchors to Detect Arabic Offensive Language and Hate Speech
- URL: http://arxiv.org/abs/2201.06723v1
- Date: Tue, 18 Jan 2022 03:56:57 GMT
- Title: Emojis as Anchors to Detect Arabic Offensive Language and Hate Speech
- Authors: Hamdy Mubarak, Sabit Hassan, Shammur Absar Chowdhury
- Abstract summary: We introduce a generic, language-independent method to collect a large percentage of offensive and hate tweets.
We harness the extralinguistic information embedded in the emojis to collect a large number of offensive tweets.
- Score: 6.1875341699258595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a generic, language-independent method to collect a large
percentage of offensive and hate tweets regardless of their topics or genres.
We harness the extralinguistic information embedded in the emojis to collect a
large number of offensive tweets. We apply the proposed method on Arabic tweets
and compare it with English tweets -- analyzing some cultural differences. We
observed a constant usage of these emojis to represent offensiveness in
throughout different timelines in Twitter. We manually annotate and publicly
release the largest Arabic dataset for offensive, fine-grained hate speech,
vulgar and violence content. Furthermore, we benchmark the dataset for
detecting offense and hate speech using different transformer architectures and
performed in-depth linguistic analysis. We evaluate our models on external
datasets -- a Twitter dataset collected using a completely different method,
and a multi-platform dataset containing comments from Twitter, YouTube and
Facebook, for assessing generalization capability. Competitive results on these
datasets suggest that the data collected using our method captures universal
characteristics of offensive language. Our findings also highlight the common
words used in offensive communications; common targets for hate speech;
specific patterns in violence tweets and pinpoints common classification errors
due to the need to understand the context, consider culture and background and
the presence of sarcasm among others.
Related papers
- Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - KOLD: Korean Offensive Language Dataset [11.699797031874233]
We present a Korean offensive language dataset (KOLD), 40k comments labeled with offensiveness, target, and targeted group information.
We show that title information serves as context and is helpful to discern the target of hatred, especially when they are omitted in the comment.
arXiv Detail & Related papers (2022-05-23T13:58:45Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Automatic Expansion and Retargeting of Arabic Offensive Language
Training [12.111859709582617]
We employ two key insights, namely that replies on Twitter often imply opposition and some accounts are persistent in their offensiveness towards specific targets.
We show the efficacy of the approach on Arabic tweets with 13% and 79% relative F1-measure improvement in entity specific offensive language detection.
arXiv Detail & Related papers (2021-11-18T08:25:09Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Leveraging Multilingual Transformers for Hate Speech Detection [11.306581296760864]
We leverage state of the art Transformer language models to identify hate speech in a multilingual setting.
With a pre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identify and classify hate speech from multiple languages.
arXiv Detail & Related papers (2021-01-08T20:23:50Z) - Improving Sentiment Analysis over non-English Tweets using Multilingual
Transformers and Automatic Translation for Data-Augmentation [77.69102711230248]
We propose the use of a multilingual transformer model, that we pre-train over English tweets and apply data-augmentation using automatic translation to adapt the model to non-English languages.
Our experiments in French, Spanish, German and Italian suggest that the proposed technique is an efficient way to improve the results of the transformers over small corpora of tweets in a non-English language.
arXiv Detail & Related papers (2020-10-07T15:44:55Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z) - Intersectional Bias in Hate Speech and Abusive Language Datasets [0.3149883354098941]
African American tweets were up to 3.7 times more likely to be labeled as abusive.
African American male tweets were up to 77% more likely to be labeled as hateful.
This study provides the first systematic evidence on intersectional bias in datasets of hate speech and abusive language.
arXiv Detail & Related papers (2020-05-12T16:58:48Z) - Arabic Offensive Language on Twitter: Analysis and Experiments [9.879488163141813]
We introduce a method for building a dataset that is not biased by topic, dialect, or target.
We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech.
arXiv Detail & Related papers (2020-04-05T13:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.