FinGPT: Large Generative Models for a Small Language
- URL: http://arxiv.org/abs/2311.05640v1
- Date: Fri, 3 Nov 2023 08:05:04 GMT
- Title: FinGPT: Large Generative Models for a Small Language
- Authors: Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna
Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas
Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao,
Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen,
Aija Vahtola, Samuel Antao, Sampo Pyysalo
- Abstract summary: We create large language models (LLMs) for Finnish, a language spoken by less than 0.1% of the world population.
We train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT.
We continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
- Score: 48.46240937758779
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) excel in many tasks in NLP and beyond, but most
open models have very limited coverage of smaller languages and LLM work tends
to focus on languages where nearly unlimited data is available for pretraining.
In this work, we study the challenges of creating LLMs for Finnish, a language
spoken by less than 0.1% of the world population. We compile an extensive
dataset of Finnish combining web crawls, news, social media and eBooks. We
pursue two approaches to pretrain models: 1) we train seven monolingual models
from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the
pretraining of the multilingual BLOOM model on a mix of its original training
data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
For model evaluation, we introduce FIN-bench, a version of BIG-bench with
Finnish tasks. We also assess other model qualities such as toxicity and bias.
Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.
Related papers
- Generative Model for Less-Resourced Language with 1 billion parameters [0.0]
GaMS 1B - Generative Model for Slovene with 1 billion parameters was created by continuing pretraining of the existing English OPT model.
We develop a new tokenizer adapted to Slovene, Croatian, and English languages.
We evaluate our models on several classification datasets from the Slovene suite of benchmarks and generative sentence simplification task SENTA.
arXiv Detail & Related papers (2024-10-09T13:59:34Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Poro 34B and the Blessing of Multilinguality [3.270981284471548]
Poro 34B is a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages.
We show that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish.
arXiv Detail & Related papers (2024-04-02T11:34:12Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Training dataset and dictionary sizes matter in BERT models: the case of
Baltic languages [0.0]
We train a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian.
We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy.
arXiv Detail & Related papers (2021-12-20T14:26:40Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.