Benchmarking Multilabel Topic Classification in the Kyrgyz Language
- URL: http://arxiv.org/abs/2308.15952v1
- Date: Wed, 30 Aug 2023 11:02:26 GMT
- Title: Benchmarking Multilabel Topic Classification in the Kyrgyz Language
- Authors: Anton Alekseev, Sergey I. Nikolenko, Gulnara Kabaeva
- Abstract summary: We present a new public benchmark for topic classification in Kyrgyz based on collected and annotated data from the news site 24.KG.
We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.
- Score: 6.15353988889181
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Kyrgyz is a very underrepresented language in terms of modern natural
language processing resources. In this work, we present a new public benchmark
for topic classification in Kyrgyz, introducing a dataset based on collected
and annotated data from the news site 24.KG and presenting several baseline
models for news classification in the multilabel setting. We train and evaluate
both classical statistical and neural models, reporting the scores, discussing
the results, and proposing directions for future work.
Related papers
- A Dataset and Strong Baselines for Classification of Czech News Texts [0.0]
We present CZEchNEwsClassificationdataset (CZE-NEC), one of the largest Czech classification datasets.
We define four classification tasks: news source, news category, inferred author's gender, and day of the week.
We show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
arXiv Detail & Related papers (2023-07-20T07:47:08Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Text classification dataset and analysis for Uzbek language [0.0]
We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites.
We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures.
Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models.
arXiv Detail & Related papers (2023-02-28T11:21:24Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Simplifying Multilingual News Clustering Through Projection From a
Shared Space [0.39560040546164016]
The task of organizing and clustering multilingual news articles for media monitoring is essential to follow news stories in real time.
Most approaches to this task focus on high-resource languages (mostly English), with low-resource languages being disregarded.
We present a much simpler online system that is able to cluster an incoming stream of documents without depending on language-specific features.
arXiv Detail & Related papers (2022-04-28T11:32:49Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z) - KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for
Kinyarwanda and Kirundi [18.01565807026177]
We introduce two news datasets for classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages.
We provide statistics, guidelines for preprocessing, and monolingual and cross-lingual baseline models.
Our experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi.
arXiv Detail & Related papers (2020-10-23T05:37:42Z) - Deep Learning Based Text Classification: A Comprehensive Review [75.8403533775179]
We provide a review of more than 150 deep learning based models for text classification developed in recent years.
We also provide a summary of more than 40 popular datasets widely used for text classification.
arXiv Detail & Related papers (2020-04-06T02:00:30Z) - Investigating an approach for low resource language dataset creation,
curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We also create a news topic classification task.
We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.