KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased
Speech in Real-World Online Services
- URL: http://arxiv.org/abs/2310.04313v2
- Date: Sun, 12 Nov 2023 17:10:32 GMT
- Title: KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased
Speech in Real-World Online Services
- Authors: Dasol Choi, Jooyoung Song, Eunsun Lee, Jinwoo Seo, Heejune Park,
Dongbin Na
- Abstract summary: "KoMultiText" is a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform.
Our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics.
Our work can provide solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health.
- Score: 5.03606775899383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growth of online services, the need for advanced text classification
algorithms, such as sentiment analysis and biased text detection, has become
increasingly evident. The anonymous nature of online services often leads to
the presence of biased and harmful language, posing challenges to maintaining
the health of online communities. This phenomenon is especially relevant in
South Korea, where large-scale hate speech detection algorithms have not yet
been broadly explored. In this paper, we introduce "KoMultiText", a new
comprehensive, large-scale dataset collected from a well-known South Korean SNS
platform. Our proposed dataset provides annotations including (1) Preferences,
(2) Profanities, and (3) Nine types of Bias for the text samples, enabling
multi-task learning for simultaneous classification of user-generated texts.
Leveraging state-of-the-art BERT-based language models, our approach surpasses
human-level accuracy across diverse classification tasks, as measured by
various metrics. Beyond academic contributions, our work can provide practical
solutions for real-world hate speech and bias mitigation, contributing directly
to the improvement of online community health. Our work provides a robust
foundation for future research aiming to improve the quality of online
discourse and foster societal well-being. All source codes and datasets are
publicly accessible at https://github.com/Dasol-Choi/KoMultiText.
Related papers
- A New Korean Text Classification Benchmark for Recognizing the Political
Intents in Online Newspapers [6.633601941627045]
We present a novel Korean text classification dataset that contains various articles.
Our dataset contains 12,000 news articles that may contain political intentions, from the politics section of six of the most representative newspaper organizations in South Korea.
To the best of our knowledge, our paper is the most large-scale Korean news dataset that contains long text and addresses multi-task classification problems.
arXiv Detail & Related papers (2023-11-03T04:59:55Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Fine-Tuning Approach for Arabic Offensive Language Detection System:
BERT-Based Model [0.0]
This study investigates the effects of fine-tuning across several Arabic offensive language datasets.
We develop multiple classifiers that use four datasets individually and in combination to gain knowledge about online Arabic offensive content.
arXiv Detail & Related papers (2022-02-07T17:26:35Z) - Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality.
We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language
Generation [42.34923623457615]
Bias in Open-Ended Language Generation dataset consists of 23,679 English text generation prompts.
An examination of text generated from three popular language models reveals that the majority of these models exhibit a larger social bias than human-written Wikipedia text.
arXiv Detail & Related papers (2021-01-27T22:07:03Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Improving Yor\`ub\'a Diacritic Restoration [3.301896537513352]
Yorub'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics.
Diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage.
All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorub'a language technology.
arXiv Detail & Related papers (2020-03-23T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.