\`It\`ak\'ur\`oso: Exploiting Cross-Lingual Transferability for Natural
Language Generation of Dialogues in Low-Resource, African Languages
- URL: http://arxiv.org/abs/2204.08083v1
- Date: Sun, 17 Apr 2022 20:23:04 GMT
- Title: \`It\`ak\'ur\`oso: Exploiting Cross-Lingual Transferability for Natural
Language Generation of Dialogues in Low-Resource, African Languages
- Authors: Tosin Adewumi, Mofetoluwa Adeyemi, Aremu Anuoluwapo, Bukola Peters,
Happy Buzaaba, Oyerinde Samuel, Amina Mardiyyah Rufai, Benjamin Ajibade,
Tajudeen Gwadabe, Mory Moussou Koulibaly Traore, Tunde Ajayi, Shamsuddeen
Muhammad, Ahmed Baruwa, Paul Owoicho, Tolulope Ogunremi, Phylis Ngigi,
Orevaoghene Ahia, Ruqayya Nasir, Foteini Liwicki and Marcus Liwicki
- Abstract summary: We investigate the possibility of cross-lingual transfer from a state-of-the-art (SoTA) deep monolingual model to 6 African languages.
The languages are Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorub'a.
The results show that the hypothesis that deep monolingual models learn some abstractions that generalise across languages holds.
- Score: 0.9511471519043974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the possibility of cross-lingual transfer from a
state-of-the-art (SoTA) deep monolingual model (DialoGPT) to 6 African
languages and compare with 2 baselines (BlenderBot 90M, another SoTA, and a
simple Seq2Seq). The languages are Swahili, Wolof, Hausa, Nigerian Pidgin
English, Kinyarwanda & Yor\`ub\'a. Generation of dialogues is known to be a
challenging task for many reasons. It becomes more challenging for African
languages which are low-resource in terms of data. Therefore, we translate a
small portion of the English multi-domain MultiWOZ dataset for each target
language. Besides intrinsic evaluation (i.e. perplexity), we conduct human
evaluation of single-turn conversations by using majority votes and measure
inter-annotator agreement (IAA). The results show that the hypothesis that deep
monolingual models learn some abstractions that generalise across languages
holds. We observe human-like conversations in 5 out of the 6 languages. It,
however, applies to different degrees in different languages, which is
expected. The language with the most transferable properties is the Nigerian
Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are
unanimous. The main contributions of this paper include the representation
(through the provision of high-quality dialogue data) of under-represented
African languages and demonstrating the cross-lingual transferability
hypothesis for dialogue systems. We also provide the datasets and host the
model checkpoints/demos on the HuggingFace hub for public access.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Languages You Know Influence Those You Learn: Impact of Language
Characteristics on Multi-Lingual Text-to-Text Transfer [4.554080966463776]
Multi-lingual language models (LM) have been remarkably successful in enabling natural language tasks in low-resource languages.
We try to better understand how such models, specifically mT5, transfer *any* linguistic and semantic knowledge across languages.
A key finding of this work is that similarity of syntax, morphology and phonology are good predictors of cross-lingual transfer.
arXiv Detail & Related papers (2022-12-04T07:22:21Z) - Multilingual Language Model Adaptive Fine-Tuning: A Study on African
Languages [19.067718464786463]
We perform multilingual adaptive fine-tuning (MAFT) on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent.
To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT.
Our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space.
arXiv Detail & Related papers (2022-04-13T16:13:49Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Can Multilingual Language Models Transfer to an Unseen Dialect? A Case
Study on North African Arabizi [2.76240219662896]
We study the ability of multilingual language models to process an unseen dialect.
We take user generated North-African Arabic as our case study.
We show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect.
arXiv Detail & Related papers (2020-05-01T11:29:23Z) - XPersona: Evaluating Multilingual Personalized Chatbot [76.00426517401894]
We propose a multi-lingual extension of Persona-Chat, namely XPersona.
Our dataset includes persona conversations in six different languages other than English for building and evaluating multilingual personalized agents.
arXiv Detail & Related papers (2020-03-17T07:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.