Offensive Language Identification in Greek
- URL: http://arxiv.org/abs/2003.07459v2
- Date: Wed, 18 Mar 2020 17:26:20 GMT
- Title: Offensive Language Identification in Greek
- Authors: Zeses Pitenis, Marcos Zampieri, Tharindu Ranasinghe
- Abstract summary: This paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet dataset (OGTD)
OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive.
Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.
- Score: 17.38318315623124
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As offensive language has become a rising issue for online communities and
social media platforms, researchers have been investigating ways of coping with
abusive content and developing systems to detect its different types:
cyberbullying, hate speech, aggression, etc. With a few notable exceptions,
most research on this topic so far has dealt with English. This is mostly due
to the availability of language resources for English. To address this
shortcoming, this paper presents the first Greek annotated dataset for
offensive language identification: the Offensive Greek Tweet Dataset (OGTD).
OGTD is a manually annotated dataset containing 4,779 posts from Twitter
annotated as offensive and not offensive. Along with a detailed description of
the dataset, we evaluate several computational models trained and tested on
this data.
Related papers
- OffensiveLang: A Community Based Implicit Offensive Language Dataset [5.813922783967869]
Hate speech or offensive languages exist in both explicit and implicit forms.
OffensiveLang is a community based implicit offensive language dataset.
We present a prompt-based approach that effectively generates implicit offensive languages.
arXiv Detail & Related papers (2024-03-04T20:34:58Z) - Explain Thyself Bully: Sentiment Aided Cyberbullying Detection with
Explanation [52.3781496277104]
Cyberbullying has become a big issue with the popularity of different social media networks and online communication apps.
Recent laws like "right to explanations" of General Data Protection Regulation have spurred research in developing interpretable models.
We develop first interpretable multi-task model called em mExCB for automatic cyberbullying detection from code-mixed languages.
arXiv Detail & Related papers (2024-01-17T07:36:22Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - SOLD: Sinhala Offensive Language Dataset [11.63228876521012]
This paper tackles offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka.
SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level.
We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.
arXiv Detail & Related papers (2022-12-01T20:18:21Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z) - SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language
Identification [34.57343857418401]
Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification.
In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner.
We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models.
arXiv Detail & Related papers (2020-04-29T20:02:58Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z) - Arabic Offensive Language on Twitter: Analysis and Experiments [9.879488163141813]
We introduce a method for building a dataset that is not biased by topic, dialect, or target.
We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech.
arXiv Detail & Related papers (2020-04-05T13:05:11Z) - Offensive Language Detection: A Comparative Analysis [2.5739449801033842]
We explore the effectiveness of Google sentence encoder, Fasttext, Dynamic mode decomposition (DMD) based features and Random kitchen sink (RKS) method for offensive language detection.
From the experiments and evaluation we observed that RKS with fastetxt achieved competing results.
arXiv Detail & Related papers (2020-01-09T17:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.