Small Languages, Big Models: A Study of Continual Training on Languages of Norway
- URL: http://arxiv.org/abs/2412.06484v2
- Date: Sun, 02 Feb 2025 23:58:48 GMT
- Title: Small Languages, Big Models: A Study of Continual Training on Languages of Norway
- Authors: David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja Øvrelid, Lucas Georges Gabriel Charpentier, Andrey Kutuzov, Stephan Oepen,
- Abstract summary: Training large language models requires vast amounts of data.
We present a novel three-stage continual training approach that substantially improves the downstream performance.
We release a new generative language model for Norwegian Bokmral, Nynorsk, and Northern S'ami with 11.4 billion parameters: NorMistral-11B.
- Score: 11.548845014405984
- License:
- Abstract: Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern S\'ami. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokm\r{a}l, Nynorsk, and Northern S\'ami with 11.4 billion parameters: NorMistral-11B.
Related papers
- LLMic: Romanian Foundation Language Model [76.09455151754062]
We present LLMic, a foundation language model designed specifically for the Romanian Language.
We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks.
arXiv Detail & Related papers (2025-01-13T22:14:45Z) - NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks.
To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models.
We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - NoCoLA: The Norwegian Corpus of Linguistic Acceptability [2.538209532048867]
We present two new Norwegian datasets for evaluating language models.
NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences.
NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner.
arXiv Detail & Related papers (2023-06-13T14:11:19Z) - BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting [50.24676567971536]
The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages.
We apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages.
We conclude that with sufficient training data language adaptation can generalize well to diverse languages.
arXiv Detail & Related papers (2022-12-19T15:24:45Z) - The Importance of Context in Very Low Resource Language Modeling [3.734153902687548]
In very low resource scenarios, statistical n-gram language models outperform state-of-the-art neural models.
We introduce three methods to improve a neural model's performance in the low-resource setting.
arXiv Detail & Related papers (2022-05-10T11:19:56Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Large-Scale Contextualised Language Modelling for Norwegian [7.5722195869569]
This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks.
In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian.
arXiv Detail & Related papers (2021-04-13T23:18:04Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Harnessing Multilinguality in Unsupervised Machine Translation for Rare
Languages [48.28540903568198]
We show that multilinguality is critical to making unsupervised systems practical for low-resource settings.
We present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions.
We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU.
arXiv Detail & Related papers (2020-09-23T15:07:33Z) - From English To Foreign Languages: Transferring Pre-trained Language
Models [0.12691047660244334]
Pre-trained models have demonstrated their effectiveness in many downstream natural language processing (NLP) tasks.
The availability of multilingual pre-trained models enables zero-shot transfer of NLP tasks from high resource languages to low resource ones.
We tackle the problem of transferring an existing pre-trained model from English to other languages under a limited computational budget.
arXiv Detail & Related papers (2020-02-18T00:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.