MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African
Languages
- URL: http://arxiv.org/abs/2305.13989v1
- Date: Tue, 23 May 2023 12:15:33 GMT
- Title: MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African
Languages
- Authors: Cheikh M. Bamba Dione, David Adelani, Peter Nabende, Jesujoba Alabi,
Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye
Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye,
Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula,
Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munkoh-Buabeng, victoire
Memdjokam Koagne, Fatoumata Ouoba Kabore, Amelia Taylor, Godson Kalipe,
Tebogo Macucwa, Vukosi Marivate, Tajuddeen Gwadabe, Mboning Tchiaze Elvis,
Ikechukwu Onyenwe, Gratien Atindogbe, Tolulope Adelani, Idris Akinade,
Olanrewaju Samuel, Marien Nahimana, Th\'eog\`ene Musabeyezu, Emile
Niyomutabazi, Ester Chimhenga, Kudzai Gotosa, Patrick Mizha, Apelete Agbolo,
Seydou Traore, Chinedu Uchechukwu, Aliyu Yusuf, Muhammad Abdullahi and
Dietrich Klakow
- Abstract summary: We present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages.
We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines.
- Score: 7.86385861664505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present MasakhaPOS, the largest part-of-speech (POS)
dataset for 20 typologically diverse African languages. We discuss the
challenges in annotating POS for these languages using the UD (universal
dependencies) guidelines. We conducted extensive POS baseline experiments using
conditional random field and several multilingual pre-trained language models.
We applied various cross-lingual transfer models trained with data available in
UD. Evaluating on the MasakhaPOS dataset, we show that choosing the best
transfer language(s) in both single-source and multi-source setups greatly
improves the POS tagging performance of the target languages, in particular
when combined with cross-lingual parameter-efficient fine-tuning methods.
Crucially, transferring knowledge from a language that matches the language
family and morphosyntactic properties seems more effective for POS tagging in
unseen languages.
Related papers
- Exploring transfer learning for Deep NLP systems on rarely annotated languages [0.0]
This thesis investigates the application of transfer learning for Part-of-Speech (POS) tagging between Hindi and Nepali.
We assess whether multitask learning in Hindi, with auxiliary tasks such as gender and singular/plural tagging, can contribute to improved POS tagging accuracy.
arXiv Detail & Related papers (2024-10-15T13:33:54Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models [26.72394783468532]
We propose an textitefficient method to study transfer language influence in zero-shot performance on another target language.
Our findings suggest that some languages do not largely affect others while some languages, especially ones unseen during pre-training, can be extremely beneficial or detrimental for different target languages.
arXiv Detail & Related papers (2024-03-29T09:52:18Z) - A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets [1.1647644386277962]
Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP.
We propose assessing linguistic diversity of a data set against a reference language sample.
arXiv Detail & Related papers (2024-03-06T18:14:22Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.