Detecting Abusive Albanian
- URL: http://arxiv.org/abs/2107.13592v1
- Date: Wed, 28 Jul 2021 18:47:32 GMT
- Title: Detecting Abusive Albanian
- Authors: Erida Nurce, Jorgel Keci, Leon Derczynski
- Abstract summary: scShaj is an annotated dataset for hate speech and offensive speech constructed from user-text content on various social media platforms.
The dataset is tested using three different classification models, the best of which achieves an F1 score of 0.77 for the identification of offensive language.
- Score: 5.092028049119383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ever growing usage of social media in the recent years has had a direct
impact on the increased presence of hate speech and offensive speech in online
platforms. Research on effective detection of such content has mainly focused
on English and a few other widespread languages, while the leftover majority
fail to have the same work put into them and thus cannot benefit from the
steady advancements made in the field. In this paper we present \textsc{Shaj},
an annotated Albanian dataset for hate speech and offensive speech that has
been constructed from user-generated content on various social media platforms.
Its annotation follows the hierarchical schema introduced in OffensEval. The
dataset is tested using three different classification models, the best of
which achieves an F1 score of 0.77 for the identification of offensive
language, 0.64 F1 score for the automatic categorization of offensive types and
lastly, 0.52 F1 score for the offensive language target identification.
Related papers
- Hate Speech Detection Using Cross-Platform Social Media Data In English and German Language [6.200058263544999]
This study focuses on detecting bilingual hate speech in YouTube comments.
We include factors such as content similarity, definition similarity, and common hate words to measure the impact of datasets on performance.
The best performance was obtained by combining datasets from YouTube comments, Twitter, and Gab with an F1-score of 0.74 and 0.68 for English and German YouTube comments.
arXiv Detail & Related papers (2024-10-02T10:22:53Z) - Unveiling Social Media Comments with a Novel Named Entity Recognition System for Identity Groups [2.5849042763002426]
We develop a Named Entity Recognition (NER) System for Identity Groups.
Our tool not only detects whether a sentence contains an attack but also tags the sentence tokens corresponding to the mentioned group.
We tested the utility of our tool in a case study on social media, annotating and comparing comments from Facebook related to news mentioning identity groups.
arXiv Detail & Related papers (2024-05-13T19:33:18Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z) - Ruddit: Norms of Offensiveness for English Reddit Comments [35.83156813452207]
We create the first dataset of English language Reddit comments that has fine-grained, real-valued scores between -1 and 1.
We show that the method produces highly reliable offensiveness scores.
We evaluate the ability of widely-used neural models to predict offensiveness scores on this new dataset.
arXiv Detail & Related papers (2021-06-10T11:27:47Z) - Leveraging Multilingual Transformers for Hate Speech Detection [11.306581296760864]
We leverage state of the art Transformer language models to identify hate speech in a multilingual setting.
With a pre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identify and classify hate speech from multiple languages.
arXiv Detail & Related papers (2021-01-08T20:23:50Z) - NLP-CIC at SemEval-2020 Task 9: Analysing sentiment in code-switching
language using a simple deep-learning classifier [63.137661897716555]
Code-switching is a phenomenon in which two or more languages are used in the same message.
We use a standard convolutional neural network model to predict the sentiment of tweets in a blend of Spanish and English languages.
arXiv Detail & Related papers (2020-09-07T19:57:09Z) - Demoting Racial Bias in Hate Speech Detection [39.376886409461775]
In current hate speech datasets, there exists a correlation between annotators' perceptions of toxicity and signals of African American English (AAE)
In this paper, we use adversarial training to mitigate this bias, introducing a hate speech classifier that learns to detect toxic sentences while demoting confounds corresponding to AAE texts.
Experimental results on a hate speech dataset and an AAE dataset suggest that our method is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.
arXiv Detail & Related papers (2020-05-25T17:43:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.