BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection
- URL: http://arxiv.org/abs/2005.12503v1
- Date: Tue, 26 May 2020 03:34:01 GMT
- Title: BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection
- Authors: Jihyung Moon, Won Ik Cho, Junbum Lee
- Abstract summary: We first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech.
The comments are annotated regarding social bias and hate speech since both aspects are correlated.
We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks.
- Score: 3.90603670603335
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Toxic comments in online platforms are an unavoidable social issue under the
cloak of anonymity. Hate speech detection has been actively done for languages
such as English, German, or Italian, where manually labeled corpus has been
released. In this work, we first present 9.4K manually labeled entertainment
news comments for identifying Korean toxic speech, collected from a widely used
online news platform in Korea. The comments are annotated regarding social bias
and hate speech since both aspects are correlated. The inter-annotator
agreement Krippendorff's alpha score is 0.492 and 0.496, respectively. We
provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the
highest score on all tasks. The models generally display better performance on
bias identification, since the hate speech detection is a more subjective
issue. Additionally, when BERT is trained with bias label for hate speech
detection, the prediction score increases, implying that bias and hate are
intertwined. We make our dataset publicly available and open competitions with
the corpus and benchmarks.
Related papers
- Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts [0.0]
We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts.
The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate.
The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level.
arXiv Detail & Related papers (2024-04-30T04:16:55Z) - K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific
Ratings [6.902524826065157]
K-HATERS is a new corpus for hate speech detection in Korean, comprising approximately 192K news comments with target-specific offensiveness ratings.
This study contributes to the NLP research on hate speech detection and resource construction.
arXiv Detail & Related papers (2023-10-24T01:20:05Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics.
We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z) - Analyzing the Intensity of Complaints on Social Media [55.140613801802886]
We present the first study in computational linguistics of measuring the intensity of complaints from text.
We create the first Chinese dataset containing 3,103 posts about complaints from Weibo, a popular Chinese social media platform.
We show that complaints intensity can be accurately estimated by computational models with the best mean square error achieving 0.11.
arXiv Detail & Related papers (2022-04-20T10:15:44Z) - APEACH: Attacking Pejorative Expressions with Analysis on
Crowd-Generated Hate Speech Evaluation Datasets [4.034948808542701]
APEACH is a method that allows the collection of hate speech generated by unspecified users.
By controlling the crowd-generation of hate speech and adding only a minimum post-labeling, we create a corpus that enables the generalizable and fair evaluation of hate speech detection.
arXiv Detail & Related papers (2022-02-25T02:04:38Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Leveraging Transformers for Hate Speech Detection in Conversational
Code-Mixed Tweets [36.29939722039909]
This paper describes the system proposed by team MIDAS-IIITD for HASOC 2021 subtask 2.
It is one of the first shared tasks focusing on detecting hate speech from Hindi-English code-mixed conversations on Twitter.
Our best performing system, a hard voting ensemble of Indic-BERT, XLM-RoBERTa, and Multilingual BERT, achieved a macro F1 score of 0.7253.
arXiv Detail & Related papers (2021-12-18T19:27:33Z) - Hate speech detection using static BERT embeddings [0.9176056742068814]
Hate speech is emerging as a major concern, where it expresses abusive speech that targets specific group characteristics.
In this paper, we analyze the performance of hate speech detection by replacing or integrating the word embeddings.
In comparison to fine-tuned BERT, one metric that significantly improved is specificity.
arXiv Detail & Related papers (2021-06-29T16:17:10Z) - Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media
during the COVID-19 Crisis [51.39895377836919]
COVID-19 has sparked racism and hate on social media targeted towards Asian communities.
We study the evolution and spread of anti-Asian hate speech through the lens of Twitter.
We create COVID-HATE, the largest dataset of anti-Asian hate and counterspeech spanning 14 months.
arXiv Detail & Related papers (2020-05-25T21:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.