KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for
Kinyarwanda and Kirundi
- URL: http://arxiv.org/abs/2010.12174v1
- Date: Fri, 23 Oct 2020 05:37:42 GMT
- Title: KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for
Kinyarwanda and Kirundi
- Authors: Rubungo Andre Niyongabo and Hong Qu and Julia Kreutzer and Li Huang
- Abstract summary: We introduce two news datasets for classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages.
We provide statistics, guidelines for preprocessing, and monolingual and cross-lingual baseline models.
Our experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi.
- Score: 18.01565807026177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in text classification has been focused on high-resource
languages such as English and Chinese. For low-resource languages, amongst them
most African languages, the lack of well-annotated data and effective
preprocessing, is hindering the progress and the transfer of successful
methods. In this paper, we introduce two news datasets (KINNEWS and KIRNEWS)
for multi-class classification of news articles in Kinyarwanda and Kirundi, two
low-resource African languages. The two languages are mutually intelligible,
but while Kinyarwanda has been studied in Natural Language Processing (NLP) to
some extent, this work constitutes the first study on Kirundi. Along with the
datasets, we provide statistics, guidelines for preprocessing, and monolingual
and cross-lingual baseline models. Our experiments show that training
embeddings on the relatively higher-resourced Kinyarwanda yields successful
cross-lingual transfer to Kirundi. In addition, the design of the created
datasets allows for a wider use in NLP beyond text classification in future
studies, such as representation learning, cross-lingual learning with more
distant languages, or as base for new annotations for tasks such as parsing,
POS tagging, and NER. The datasets, stopwords, and pre-trained embeddings are
publicly available at https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus .
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Benchmarking Multilabel Topic Classification in the Kyrgyz Language [6.15353988889181]
We present a new public benchmark for topic classification in Kyrgyz based on collected and annotated data from the news site 24.KG.
We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.
arXiv Detail & Related papers (2023-08-30T11:02:26Z) - Izindaba-Tindzaba: Machine learning news categorisation for Long and
Short Text for isiZulu and Siswati [1.666378501554705]
Local/Native South African languages are classified as low-resource languages.
In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages.
arXiv Detail & Related papers (2023-06-12T21:02:12Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Geographical Distance Is The New Hyperparameter: A Case Study Of Finding
The Optimal Pre-trained Language For English-isiZulu Machine Translation [0.0]
This study explores the potential benefits of transfer learning in an English-isiZulu translation framework.
We gathered results from 8 different language corpora, including one multi-lingual corpus, and saw that isiXa-isiZulu outperformed all languages.
We also derived a new coefficient, Nasir's Geographical Distance Coefficient (NGDC) which provides an easy selection of languages for the pre-trained models.
arXiv Detail & Related papers (2022-05-17T20:41:25Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Low resource language dataset creation, curation and classification:
Setswana and Sepedi -- Extended Abstract [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages.
arXiv Detail & Related papers (2020-03-30T18:03:15Z) - Investigating an approach for low resource language dataset creation,
curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We also create a news topic classification task.
We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.