Related papers: Small Languages, Big Models: A Study of Continual Training on Languages of Norway

Related papers

Kakugo: Distillation of Low-Resource Languages into Small Language Models [18.888596863202377]
Kakugo is a pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages.<n>With a total generation and training cost of under $50 per language, Kakugo offers a method accessible for communities to develop language-specific AI.
arXiv Detail & Related papers (2026-01-20T15:05:44Z)
Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models [11.719190735841407]
Large language models exhibit uneven performance across languages.<n>We present a framework for enhancing monolingual capabilities of LLMs in underrepresented languages.<n>Our approach identifies language-specific neurons using Language Activation Probability Entropy and fine-tunes only the weights associated with these neurons.
arXiv Detail & Related papers (2025-10-15T14:14:49Z)
LLMic: Romanian Foundation Language Model [76.09455151754062]
We present LLMic, a foundation language model designed specifically for the Romanian Language. We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks.
arXiv Detail & Related papers (2025-01-13T22:14:45Z)
Poro 34B and the Blessing of Multilinguality [3.270981284471548]
Poro 34B is a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages. We show that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish.
arXiv Detail & Related papers (2024-04-02T11:34:12Z)
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks. To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models. We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z)
FinGPT: Large Generative Models for a Small Language [48.46240937758779]
We create large language models (LLMs) for Finnish, a language spoken by less than 0.1% of the world population. We train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT. We continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
arXiv Detail & Related papers (2023-11-03T08:05:04Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
NoCoLA: The Norwegian Corpus of Linguistic Acceptability [2.538209532048867]
We present two new Norwegian datasets for evaluating language models. NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner.
arXiv Detail & Related papers (2023-06-13T14:11:19Z)
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting [50.24676567971536]
The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages. We apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages. We conclude that with sufficient training data language adaptation can generalize well to diverse languages.
arXiv Detail & Related papers (2022-12-19T15:24:45Z)
The Importance of Context in Very Low Resource Language Modeling [3.734153902687548]
In very low resource scenarios, statistical n-gram language models outperform state-of-the-art neural models. We introduce three methods to improve a neural model's performance in the low-resource setting.
arXiv Detail & Related papers (2022-05-10T11:19:56Z)
Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages [0.0]
We train a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy.
arXiv Detail & Related papers (2021-12-20T14:26:40Z)
AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas. We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches. We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z)
Large-Scale Contextualised Language Modelling for Norwegian [7.5722195869569]
This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks. In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian.
arXiv Detail & Related papers (2021-04-13T23:18:04Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages [48.28540903568198]
We show that multilinguality is critical to making unsupervised systems practical for low-resource settings. We present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions. We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU.
arXiv Detail & Related papers (2020-09-23T15:07:33Z)
Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language. The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model. Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z)
Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation [73.65237422910738]
We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models.
arXiv Detail & Related papers (2020-04-21T08:20:25Z)
From English To Foreign Languages: Transferring Pre-trained Language Models [0.12691047660244334]
Pre-trained models have demonstrated their effectiveness in many downstream natural language processing (NLP) tasks. The availability of multilingual pre-trained models enables zero-shot transfer of NLP tasks from high resource languages to low resource ones. We tackle the problem of transferring an existing pre-trained model from English to other languages under a limited computational budget.
arXiv Detail & Related papers (2020-02-18T00:22:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.