Adapting to the Low-Resource Double-Bind: Investigating Low-Compute
Methods on Low-Resource African Languages
- URL: http://arxiv.org/abs/2303.16985v1
- Date: Wed, 29 Mar 2023 19:25:43 GMT
- Title: Adapting to the Low-Resource Double-Bind: Investigating Low-Compute
Methods on Low-Resource African Languages
- Authors: Colin Leong, Herumb Shandilya, Bonaventure F. P. Dossou, Atnafu
Lambebo Tonja, Joel Mathew, Abdul-Hakeem Omotayo, Oreen Yousuf, Zainab
Akinjobi, Chris Chinenye Emezue, Shamsudeen Muhammad, Steven Kolawole,
Younwoo Choi, Tosin Adewumi
- Abstract summary: Access to high computational resources added to the issue of data scarcity of African languages.
We evaluate language adapters as cost-effective approaches to low-resource African NLP.
This opens the door to further experimentation and exploration on full-extent of language adapters capacities.
- Score: 0.6833698896122186
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many natural language processing (NLP) tasks make use of massively
pre-trained language models, which are computationally expensive. However,
access to high computational resources added to the issue of data scarcity of
African languages constitutes a real barrier to research experiments on these
languages. In this work, we explore the applicability of low-compute approaches
such as language adapters in the context of this low-resource double-bind. We
intend to answer the following question: do language adapters allow those who
are doubly bound by data and compute to practically build useful models?
Through fine-tuning experiments on African languages, we evaluate their
effectiveness as cost-effective approaches to low-resource African NLP. Using
solely free compute resources, our results show that language adapters achieve
comparable performances to massive pre-trained language models which are heavy
on computational resources. This opens the door to further experimentation and
exploration on full-extent of language adapters capacities.
Related papers
- Language Portability Strategies for Open-domain Dialogue with Pre-trained Language Models from High to Low Resource Languages [1.7436854281619139]
We propose a study of linguistic portability strategies of large pre-trained language models (PLMs) used for open-domain dialogue systems.
In particular the target low-resource language (L_T) will be simulated with French, as it lacks of task-specific resources.
arXiv Detail & Related papers (2024-07-01T14:20:54Z) - LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages [1.149936119867417]
Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling.
We propose leveraging the potential of LLMs in the active learning loop for data annotation.
Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements.
arXiv Detail & Related papers (2024-04-02T19:34:22Z) - High-quality Data-to-Text Generation for Severely Under-Resourced
Languages with Out-of-the-box Large Language Models [5.632410663467911]
We explore the extent to which pretrained large language models (LLMs) can bridge the performance gap for under-resourced languages.
We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins.
For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English.
arXiv Detail & Related papers (2024-02-19T16:29:40Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - A Survey on Low-Resource Neural Machine Translation [106.51056217748388]
We classify related works into three categories according to the auxiliary data they used.
We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms.
arXiv Detail & Related papers (2021-07-09T06:26:38Z) - Low-Resource Language Modelling of South African Languages [6.805575417034369]
We evaluate the performance of open-vocabulary language models on low-resource South African languages.
We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs) and Transformers on small-scale datasets.
Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets.
arXiv Detail & Related papers (2021-04-01T21:27:27Z) - Low-Resource Machine Translation for Low-Resource Languages: Leveraging
Comparable Data, Code-Switching and Compute Resources [4.119597443825115]
We conduct an empirical study of unsupervised neural machine translation (NMT) for truly low resource languages.
We show how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance.
Our work is the first to quantitatively showcase the impact of different modest compute resource in low resource NMT.
arXiv Detail & Related papers (2021-03-24T15:40:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.