Adapting to the Low-Resource Double-Bind: Investigating Low-Compute
Methods on Low-Resource African Languages
- URL: http://arxiv.org/abs/2303.16985v1
- Date: Wed, 29 Mar 2023 19:25:43 GMT
- Title: Adapting to the Low-Resource Double-Bind: Investigating Low-Compute
Methods on Low-Resource African Languages
- Authors: Colin Leong, Herumb Shandilya, Bonaventure F. P. Dossou, Atnafu
Lambebo Tonja, Joel Mathew, Abdul-Hakeem Omotayo, Oreen Yousuf, Zainab
Akinjobi, Chris Chinenye Emezue, Shamsudeen Muhammad, Steven Kolawole,
Younwoo Choi, Tosin Adewumi
- Abstract summary: Access to high computational resources added to the issue of data scarcity of African languages.
We evaluate language adapters as cost-effective approaches to low-resource African NLP.
This opens the door to further experimentation and exploration on full-extent of language adapters capacities.
- Score: 0.6833698896122186
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many natural language processing (NLP) tasks make use of massively
pre-trained language models, which are computationally expensive. However,
access to high computational resources added to the issue of data scarcity of
African languages constitutes a real barrier to research experiments on these
languages. In this work, we explore the applicability of low-compute approaches
such as language adapters in the context of this low-resource double-bind. We
intend to answer the following question: do language adapters allow those who
are doubly bound by data and compute to practically build useful models?
Through fine-tuning experiments on African languages, we evaluate their
effectiveness as cost-effective approaches to low-resource African NLP. Using
solely free compute resources, our results show that language adapters achieve
comparable performances to massive pre-trained language models which are heavy
on computational resources. This opens the door to further experimentation and
exploration on full-extent of language adapters capacities.
Related papers
- Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages [5.376127198656944]
We compare three dataset creation strategies: (1) LLM-assisted dataset generation, (2) machine translation, and (3) human-written data by native speakers, to build a culturally nuanced story comprehension dataset.
Our findings indicate that LLM-assisted data creation outperforms machine translation.
arXiv Detail & Related papers (2025-02-18T15:14:58Z) - Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - Efficient Continual Pre-training of LLMs for Low-resource Languages [45.44796295841526]
We develop a new algorithm to select a subset of texts from a larger corpus.
In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.
arXiv Detail & Related papers (2024-12-13T16:13:35Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - A Survey on Low-Resource Neural Machine Translation [106.51056217748388]
We classify related works into three categories according to the auxiliary data they used.
We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms.
arXiv Detail & Related papers (2021-07-09T06:26:38Z) - Low-Resource Language Modelling of South African Languages [6.805575417034369]
We evaluate the performance of open-vocabulary language models on low-resource South African languages.
We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs) and Transformers on small-scale datasets.
Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets.
arXiv Detail & Related papers (2021-04-01T21:27:27Z) - Low-Resource Machine Translation for Low-Resource Languages: Leveraging
Comparable Data, Code-Switching and Compute Resources [4.119597443825115]
We conduct an empirical study of unsupervised neural machine translation (NMT) for truly low resource languages.
We show how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance.
Our work is the first to quantitatively showcase the impact of different modest compute resource in low resource NMT.
arXiv Detail & Related papers (2021-03-24T15:40:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.