SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language
Identification
- URL: http://arxiv.org/abs/2004.14454v2
- Date: Fri, 24 Sep 2021 16:36:35 GMT
- Title: SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language
Identification
- Authors: Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Marcos Zampieri,
Preslav Nakov
- Abstract summary: Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification.
In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner.
We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models.
- Score: 34.57343857418401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread use of offensive content in social media has led to an
abundance of research in detecting language such as hate speech, cyberbullying,
and cyber-aggression. Recent work presented the OLID dataset, which follows a
taxonomy for offensive language identification that provides meaningful
information for understanding the type and the target of offensive messages.
However, it is limited in size and it might be biased towards offensive
language as it was collected using keywords. In this work, we present SOLID, an
expanded dataset, where the tweets were collected in a more principled manner.
SOLID contains over nine million English tweets labeled in a semi-supervised
fashion. We demonstrate that using SOLID along with OLID yields sizable
performance gains on the OLID test set for two different models, especially for
the lower levels of the taxonomy.
Related papers
- Offensive Language Identification in Transliterated and Code-Mixed
Bangla [29.30985521838655]
In this paper, we explore offensive language identification in texts with transliterations and code-mixing.
We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments.
We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset.
arXiv Detail & Related papers (2023-11-25T13:27:22Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have [58.23138483086277]
In this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection.
Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain.
Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages.
arXiv Detail & Related papers (2023-05-23T14:04:12Z) - SOLD: Sinhala Offensive Language Dataset [11.63228876521012]
This paper tackles offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka.
SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level.
We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.
arXiv Detail & Related papers (2022-12-01T20:18:21Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z) - Offensive Language Identification in Greek [17.38318315623124]
This paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet dataset (OGTD)
OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive.
Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.
arXiv Detail & Related papers (2020-03-16T22:47:27Z) - Offensive Language Detection: A Comparative Analysis [2.5739449801033842]
We explore the effectiveness of Google sentence encoder, Fasttext, Dynamic mode decomposition (DMD) based features and Random kitchen sink (RKS) method for offensive language detection.
From the experiments and evaluation we observed that RKS with fastetxt achieved competing results.
arXiv Detail & Related papers (2020-01-09T17:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.