Conversations in Galician: a Large Language Model for an
Underrepresented Language
- URL: http://arxiv.org/abs/2311.03812v1
- Date: Tue, 7 Nov 2023 08:52:28 GMT
- Title: Conversations in Galician: a Large Language Model for an
Underrepresented Language
- Authors: Eliseo Bao, Anxo P\'erez and Javier Parapar
- Abstract summary: This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language.
We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations.
As a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model.
- Score: 2.433983268807517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent proliferation of Large Conversation Language Models has
highlighted the economic significance of widespread access to this type of AI
technologies in the current information age. Nevertheless, prevailing models
have primarily been trained on corpora consisting of documents written in
popular languages. The dearth of such cutting-edge tools for low-resource
languages further exacerbates their underrepresentation in the current economic
landscape, thereby impacting their native speakers. This paper introduces two
novel resources designed to enhance Natural Language Processing (NLP) for the
Galician language. We present a Galician adaptation of the Alpaca dataset,
comprising 52,000 instructions and demonstrations. This dataset proves
invaluable for enhancing language models by fine-tuning them to more accurately
adhere to provided instructions. Additionally, as a demonstration of the
dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician,
a language not originally supported by the model, by following the Alpaca
format. This work contributes to the research on multilingual models tailored
for low-resource settings, a crucial endeavor in ensuring the inclusion of all
linguistic communities in the development of Large Language Models. Another
noteworthy aspect of this research is the exploration of how knowledge of a
closely related language, in this case, Portuguese, can assist in generating
coherent text when training resources are scarce. Both the Galician Alpaca
dataset and Cabuxa-7B are publicly accessible on our Huggingface Hub, and we
have made the source code available to facilitate replication of this
experiment and encourage further advancements for underrepresented languages.
Related papers
- Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language [7.289015788793582]
This work focuses on increasing technological participation for the S'ami language.
We draw the attention of the ML community towards the language modeling problem of Ultra Low Resource (ULR) languages.
We have compiled the available S'ami language resources from the web to create a clean dataset for training language models.
arXiv Detail & Related papers (2024-05-09T13:54:22Z) - CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian
Language [7.214355350362308]
The LLaMA (Large Language Model Meta AI) family represents a novel advancement in the field of natural language processing.
This study contributes to Language Adaptation strategies for the Italian language by introducing the novel LLaMA family of Italian LLMs.
arXiv Detail & Related papers (2023-12-15T18:06:22Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - From Masked Language Modeling to Translation: Non-English Auxiliary
Tasks Improve Zero-shot Spoken Language Understanding [24.149299722716155]
We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect.
We propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer.
Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.
arXiv Detail & Related papers (2021-05-15T23:51:11Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.