Related papers: Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

URL: http://arxiv.org/abs/2412.13922v1
Date: Wed, 18 Dec 2024 15:05:59 GMT
Title: Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque
Authors: Ander Corral, Ixak Sarasua, Xabier Saralegi,
Abstract summary: Large language models (LLMs) are typically optimized for resource-rich languages like English, exacerbating the gap between high-resource and underrepresented languages.<n>This work presents a detailed analysis of strategies for developing a model capable of following instructions in a low-resource language, specifically Basque, by focusing on three key stages: pre-training, instruction tuning, and alignment with human preferences.
Score: 2.867517731896504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are typically optimized for resource-rich languages like English, exacerbating the gap between high-resource and underrepresented languages. This work presents a detailed analysis of strategies for developing a model capable of following instructions in a low-resource language, specifically Basque, by focusing on three key stages: pre-training, instruction tuning, and alignment with human preferences. Our findings demonstrate that continual pre-training with a high-quality Basque corpus of around 600 million words improves natural language understanding (NLU) of the foundational model by over 12 points. Moreover, instruction tuning and human preference alignment using automatically translated datasets proved highly effective, resulting in a 24-point improvement in instruction-following performance. The resulting models, Llama-eus-8B and Llama-eus-8B-instruct, establish a new state-of-the-art for Basque in the sub-10B parameter category.

Related papers

Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters [53.59868121093848]
We introduce Seed-X, a family of open-source language models (LLMs) with 7B parameter size.<n>The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages.<n>The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs.
arXiv Detail & Related papers (2025-07-18T03:19:43Z)
Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque [34.70526082204771]
Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages.<n>We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone.
arXiv Detail & Related papers (2025-06-09T09:54:47Z)
Bielik 11B v2 Technical Report [0.0]
Bielik 11B v2 is a state-of-the-art language model optimized for Polish text processing.<n>It is built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling.<n>We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss and Adaptive Learning Rate.
arXiv Detail & Related papers (2025-05-05T07:03:41Z)
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings [0.7874708385247353]
This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture. We leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs.
arXiv Detail & Related papers (2025-02-24T08:38:21Z)
SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment [78.4550589538805]
We propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs.
arXiv Detail & Related papers (2025-01-07T10:29:43Z)
SnakModel: Lessons Learned from Training an Open Danish Large Language Model [18.6197271443555]
We present SnakModel, a Danish large language model (LLM) based on Llama2-7B.<n>We continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions.
arXiv Detail & Related papers (2024-12-17T14:38:21Z)
DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models. We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z)
Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z)
Cross-lingual Transfer Learning for Javanese Dependency Parsing [0.20537467311538835]
This study focuses on assessing the efficacy of transfer learning in enhancing dependency parsing for Javanese. We utilize the Universal Dependencies dataset consisting of dependency treebanks from more than 100 languages, including Javanese.
arXiv Detail & Related papers (2024-01-22T16:13:45Z)
PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning [46.153828074152436]
We propose a pivot language guided generation approach to enhance instruction tuning in lower-resource languages. It trains the model to first process instructions in the pivot language, and then produce responses in the target language. Our approach demonstrates a significant improvement in the instruction-following abilities of LLMs by 29% on average.
arXiv Detail & Related papers (2023-11-15T05:28:07Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
Strategies for improving low resource speech to text translation relying on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST) We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z)
Geographical Distance Is The New Hyperparameter: A Case Study Of Finding The Optimal Pre-trained Language For English-isiZulu Machine Translation [0.0]
This study explores the potential benefits of transfer learning in an English-isiZulu translation framework. We gathered results from 8 different language corpora, including one multi-lingual corpus, and saw that isiXa-isiZulu outperformed all languages. We also derived a new coefficient, Nasir's Geographical Distance Coefficient (NGDC) which provides an easy selection of languages for the pre-trained models.
arXiv Detail & Related papers (2022-05-17T20:41:25Z)
AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z)
Learning to Learn Morphological Inflection for Resource-Poor Languages [105.11499402984482]
We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem. Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters. Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines.
arXiv Detail & Related papers (2020-04-28T05:13:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.