K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online
News Comment
- URL: http://arxiv.org/abs/2208.10684v1
- Date: Tue, 23 Aug 2022 02:10:53 GMT
- Title: K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online
News Comment
- Authors: Jean Lee, Taejun Lim, Heejun Lee, Bogeun Jo, Yangsok Kim, Heegeun Yoon
and Soyeon Caren Han
- Abstract summary: We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns.
The dataset consists of 109k utterances from news comments and provides multi-label classification from 1 to 4 labels.
KR-BERT with sub-character tokenizer outperforms, recognising decomposed characters in each hate speech class.
- Score: 3.428320237347854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online Hate speech detection has become important with the growth of digital
devices, but resources in languages other than English are extremely limited.
We introduce K-MHaS, a new multi-label dataset for hate speech detection that
effectively handles Korean language patterns. The dataset consists of 109k
utterances from news comments and provides multi-label classification from 1 to
4 labels, and handling subjectivity and intersectionality. We evaluate strong
baselines on K-MHaS. KR-BERT with sub-character tokenizer outperforms,
recognising decomposed characters in each hate speech class.
Related papers
- BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla [0.0]
We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset comprising 37.3k samples.
The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups.
Experiments reveal that our further pre-trained encoders are achieving state-of-the-art performance on the BanTH dataset.
arXiv Detail & Related papers (2024-10-17T07:15:15Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific
Ratings [6.902524826065157]
K-HATERS is a new corpus for hate speech detection in Korean, comprising approximately 192K news comments with target-specific offensiveness ratings.
This study contributes to the NLP research on hate speech detection and resource construction.
arXiv Detail & Related papers (2023-10-24T01:20:05Z) - Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis [44.17106903728264]
Most hate speech datasets neglect the cultural diversity within a single language.
To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset.
Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%.
arXiv Detail & Related papers (2023-08-31T13:14:47Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Leveraging Multilingual Transformers for Hate Speech Detection [11.306581296760864]
We leverage state of the art Transformer language models to identify hate speech in a multilingual setting.
With a pre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identify and classify hate speech from multiple languages.
arXiv Detail & Related papers (2021-01-08T20:23:50Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.