A New Korean Text Classification Benchmark for Recognizing the Political
Intents in Online Newspapers
- URL: http://arxiv.org/abs/2311.01712v1
- Date: Fri, 3 Nov 2023 04:59:55 GMT
- Title: A New Korean Text Classification Benchmark for Recognizing the Political
Intents in Online Newspapers
- Authors: Beomjune Kim, Eunsun Lee, Dongbin Na
- Abstract summary: We present a novel Korean text classification dataset that contains various articles.
Our dataset contains 12,000 news articles that may contain political intentions, from the politics section of six of the most representative newspaper organizations in South Korea.
To the best of our knowledge, our paper is the most large-scale Korean news dataset that contains long text and addresses multi-task classification problems.
- Score: 6.633601941627045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many users reading online articles in various magazines may suffer
considerable difficulty in distinguishing the implicit intents in texts. In
this work, we focus on automatically recognizing the political intents of a
given online newspaper by understanding the context of the text. To solve this
task, we present a novel Korean text classification dataset that contains
various articles. We also provide deep-learning-based text classification
baseline models trained on the proposed dataset. Our dataset contains 12,000
news articles that may contain political intentions, from the politics section
of six of the most representative newspaper organizations in South Korea. All
the text samples are labeled simultaneously in two aspects (1) the level of
political orientation and (2) the level of pro-government. To the best of our
knowledge, our paper is the most large-scale Korean news dataset that contains
long text and addresses multi-task classification problems. We also train
recent state-of-the-art (SOTA) language models that are based on transformer
architectures and demonstrate that the trained models show decent text
classification performance. All the codes, datasets, and trained models are
available at https://github.com/Kdavid2355/KoPolitic-Benchmark-Dataset.
Related papers
- Newswire: A Large-Scale Structured Database of a Century of Historical News [3.562368079040469]
Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world.
We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers.
The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977.
arXiv Detail & Related papers (2024-06-13T16:20:05Z) - KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased
Speech in Real-World Online Services [5.03606775899383]
"KoMultiText" is a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform.
Our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics.
Our work can provide solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health.
arXiv Detail & Related papers (2023-10-06T15:19:39Z) - American Stories: A Large-Scale Structured Text Dataset of Historical
U.S. Newspapers [7.161822501147275]
This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images.
It applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection.
The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes.
arXiv Detail & Related papers (2023-08-24T00:24:42Z) - Like a Good Nearest Neighbor: Practical Content Moderation and Text
Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor.
LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z) - Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality.
We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z) - Hierarchical Text Classification of Urdu News using Deep Neural Network [0.0]
This paper proposes a deep learning model for hierarchical text classification of news in Urdu language.
It consists of 51,325 sentences from 8 online news websites belonging to the following genres: Sports; Technology; and Entertainment.
arXiv Detail & Related papers (2021-07-07T11:06:11Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - An Amharic News Text classification Dataset [0.0]
We aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes.
This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
arXiv Detail & Related papers (2021-03-10T16:36:39Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Deep Learning for Text Style Transfer: A Survey [71.8870854396927]
Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text.
We present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017.
We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data.
arXiv Detail & Related papers (2020-11-01T04:04:43Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.