A Dataset and Strong Baselines for Classification of Czech News Texts
- URL: http://arxiv.org/abs/2307.10666v1
- Date: Thu, 20 Jul 2023 07:47:08 GMT
- Title: A Dataset and Strong Baselines for Classification of Czech News Texts
- Authors: Hynek Kydl\'i\v{c}ek, Jind\v{r}ich Libovick\'y
- Abstract summary: We present CZEchNEwsClassificationdataset (CZE-NEC), one of the largest Czech classification datasets.
We define four classification tasks: news source, news category, inferred author's gender, and day of the week.
We show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained models for Czech Natural Language Processing are often evaluated
on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple
classification tasks such as sentiment classification or article classification
from a single news source. As an alternative, we present
CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech
classification datasets, composed of news articles from various sources
spanning over twenty years, which allows a more rigorous evaluation of such
models. We define four classification tasks: news source, news category,
inferred author's gender, and day of the week. To verify the task difficulty,
we conducted a human evaluation, which revealed that human performance lags
behind strong machine-learning baselines built upon pre-trained transformer
models. Furthermore, we show that language-specific pre-trained encoder
analysis outperforms selected commercially available large-scale generative
language models.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Benchmarking Multilabel Topic Classification in the Kyrgyz Language [6.15353988889181]
We present a new public benchmark for topic classification in Kyrgyz based on collected and annotated data from the news site 24.KG.
We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.
arXiv Detail & Related papers (2023-08-30T11:02:26Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Using Language Models on Low-end Hardware [17.33390660481404]
This paper evaluates the viability of using fixed language models for training text classification networks on low-end hardware.
We combine language models with a CNN architecture and put together a comprehensive benchmark with 8 datasets covering single-label and multi-label classification of topic, sentiment, and genre.
arXiv Detail & Related papers (2023-05-03T18:00:03Z) - Text classification dataset and analysis for Uzbek language [0.0]
We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites.
We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures.
Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models.
arXiv Detail & Related papers (2023-02-28T11:21:24Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.