Efficient Language Adaptive Pre-training: Extending State-of-the-Art
Large Language Models for Polish
- URL: http://arxiv.org/abs/2402.09759v1
- Date: Thu, 15 Feb 2024 07:17:10 GMT
- Title: Efficient Language Adaptive Pre-training: Extending State-of-the-Art
Large Language Models for Polish
- Authors: Szymon Ruci\'nski
- Abstract summary: This study explores the potential of fine-tuning foundational English Large Language Models (LLMs) for generating Polish text.
The first step involves Language Adaptive Pre-training (LAPT) on a high-quality dataset of 3.11 GB, consisting of 276 million Polish tokens.
Our trained model Curie-7B-v1 not only generates Polish text with the lowest perplexity of 3.02 among decoder-based Polish models but also closely rivals the performance of the best Polish encoder-decoder models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This study explores the potential of fine-tuning foundational English Large
Language Models (LLMs) for generating Polish text. The first step involves
Language Adaptive Pre-training (LAPT) on a high-quality dataset of 3.11 GB,
consisting of 276 million Polish tokens. The LAPT is followed by additional
fine-tuning aimed at solving nine KLEJ challenges. Our trained model
Curie-7B-v1 not only generates Polish text with the lowest perplexity of 3.02
among decoder-based Polish models but also closely rivals the performance of
the best Polish encoder-decoder models with a less than 2% gap on 8 out of 9
tasks. Curie-7B-v1 used approximately 2-3% of a typical dataset size to learn
Polish. The LAPT was completed in less than five days using a consumer GPU,
highlighting the method's efficiency. The proficiency of the model in Polish
was significantly enhanced, demonstrating the viability of this approach for
adding new languages to existing LLMs by training just 1.2% of its parameters.
To contribute to the community's collaborative progress, the model has been
released as open-source.
Related papers
- Bielik v3 Small: Technical Report [0.0]
We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing.<n>These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts.
arXiv Detail & Related papers (2025-05-05T10:39:51Z) - Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation [0.0]
Bielik 7B v0.1 is a generative text model for Polish language processing.
It addresses key challenges in language model development through innovative techniques.
It demonstrates significant improvements, achieving a 9 percentage point increase in average score compared to Mistral-7B-v0.1 on the RAG Reader task.
It also excels in the Polish MT-Bench, particularly in Reasoning (6.15/10) and Role-playing (7.83/10) categories.
arXiv Detail & Related papers (2024-10-24T09:16:09Z) - Open Generative Large Language Models for Galician [1.3049334790726996]
Large language models (LLMs) have transformed natural language processing.
Yet, their predominantly English-centric training has led to biases and performance disparities across languages.
This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician.
We present the first two generative LLMs focused on Galician to bridge this gap.
arXiv Detail & Related papers (2024-06-19T23:49:56Z) - Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks [0.9786690381850356]
This study presents in-depth examination of 7 prominent Large Language Models (LLMs) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models.
Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.
arXiv Detail & Related papers (2024-05-24T11:30:37Z) - Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language [0.1534667887016089]
We introduce a few-shot benchmark consisting of 7 different classification tasks native to the Polish language.
We conducted an empirical comparison with 0 and 16 shots between fine-tuning, linear probing, SetFit, and in-context learning (ICL) using various pre-trained commercial and open-source models.
ICL achieves the best performance, with commercial models like GPT-3.5 and GPT-4 attaining the best performance.
arXiv Detail & Related papers (2024-04-27T08:53:58Z) - CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - Multilingual Speech Translation with Efficient Finetuning of Pretrained
Models [82.22294901727933]
A minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability.
Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model.
arXiv Detail & Related papers (2020-10-24T08:15:08Z) - Pre-training Polish Transformer-based Language Models at Scale [1.0312968200748118]
We present two language models for Polish based on the popular BERT architecture.
We describe our methodology for collecting the data, preparing the corpus, and pre-training the model.
We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements in eleven of them.
arXiv Detail & Related papers (2020-06-07T18:48:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.