SOLD: Sinhala Offensive Language Dataset
- URL: http://arxiv.org/abs/2212.00851v2
- Date: Thu, 28 Mar 2024 09:25:21 GMT
- Title: SOLD: Sinhala Offensive Language Dataset
- Authors: Tharindu Ranasinghe, Isuri Anuradha, Damith Premasiri, Kanishka Silva, Hansi Hettiarachchi, Lasitha Uyangodage, Marcos Zampieri,
- Abstract summary: This paper tackles offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka.
SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level.
We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.
- Score: 11.63228876521012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.
Related papers
- Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Offensive Language Identification in Transliterated and Code-Mixed
Bangla [29.30985521838655]
In this paper, we explore offensive language identification in texts with transliterations and code-mixing.
We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments.
We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset.
arXiv Detail & Related papers (2023-11-25T13:27:22Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Predicting the Type and Target of Offensive Social Media Posts in
Marathi [12.454406287184064]
We introduce the Marathi Offensive Language dataset v.2.0 or MOLD 2.0.
MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi.
We also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.
arXiv Detail & Related papers (2022-11-22T20:36:44Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi [2.4737119633827174]
MOLD is the first dataset of its kind compiled for Marathi, opening a new domain for research in low-resource Indo-Aryan languages.
We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers.
arXiv Detail & Related papers (2021-09-08T11:29:44Z) - SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language
Identification [34.57343857418401]
Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification.
In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner.
We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models.
arXiv Detail & Related papers (2020-04-29T20:02:58Z) - Offensive Language Identification in Greek [17.38318315623124]
This paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet dataset (OGTD)
OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive.
Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.
arXiv Detail & Related papers (2020-03-16T22:47:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.