Detecting Abusive Albanian
- URL: http://arxiv.org/abs/2107.13592v1
- Date: Wed, 28 Jul 2021 18:47:32 GMT
- Title: Detecting Abusive Albanian
- Authors: Erida Nurce, Jorgel Keci, Leon Derczynski
- Abstract summary: scShaj is an annotated dataset for hate speech and offensive speech constructed from user-text content on various social media platforms.
The dataset is tested using three different classification models, the best of which achieves an F1 score of 0.77 for the identification of offensive language.
- Score: 5.092028049119383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ever growing usage of social media in the recent years has had a direct
impact on the increased presence of hate speech and offensive speech in online
platforms. Research on effective detection of such content has mainly focused
on English and a few other widespread languages, while the leftover majority
fail to have the same work put into them and thus cannot benefit from the
steady advancements made in the field. In this paper we present \textsc{Shaj},
an annotated Albanian dataset for hate speech and offensive speech that has
been constructed from user-generated content on various social media platforms.
Its annotation follows the hierarchical schema introduced in OffensEval. The
dataset is tested using three different classification models, the best of
which achieves an F1 score of 0.77 for the identification of offensive
language, 0.64 F1 score for the automatic categorization of offensive types and
lastly, 0.52 F1 score for the offensive language target identification.
Related papers
- Unveiling Social Media Comments with a Novel Named Entity Recognition System for Identity Groups [2.5849042763002426]
We develop a Named Entity Recognition (NER) System for Identity Groups.
Our tool not only detects whether a sentence contains an attack but also tags the sentence tokens corresponding to the mentioned group.
We tested the utility of our tool in a case study on social media, annotating and comparing comments from Facebook related to news mentioning identity groups.
arXiv Detail & Related papers (2024-05-13T19:33:18Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z) - One to rule them all: Towards Joint Indic Language Hate Speech Detection [7.296361860015606]
We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection.
On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B.
arXiv Detail & Related papers (2021-09-28T13:30:00Z) - Ruddit: Norms of Offensiveness for English Reddit Comments [35.83156813452207]
We create the first dataset of English language Reddit comments that has fine-grained, real-valued scores between -1 and 1.
We show that the method produces highly reliable offensiveness scores.
We evaluate the ability of widely-used neural models to predict offensiveness scores on this new dataset.
arXiv Detail & Related papers (2021-06-10T11:27:47Z) - Leveraging Multilingual Transformers for Hate Speech Detection [11.306581296760864]
We leverage state of the art Transformer language models to identify hate speech in a multilingual setting.
With a pre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identify and classify hate speech from multiple languages.
arXiv Detail & Related papers (2021-01-08T20:23:50Z) - NLP-CIC at SemEval-2020 Task 9: Analysing sentiment in code-switching
language using a simple deep-learning classifier [63.137661897716555]
Code-switching is a phenomenon in which two or more languages are used in the same message.
We use a standard convolutional neural network model to predict the sentiment of tweets in a blend of Spanish and English languages.
arXiv Detail & Related papers (2020-09-07T19:57:09Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.