Language Resources for Dutch Large Language Modelling
- URL: http://arxiv.org/abs/2312.12852v1
- Date: Wed, 20 Dec 2023 09:06:06 GMT
- Title: Language Resources for Dutch Large Language Modelling
- Authors: Bram Vanroy
- Abstract summary: We introduce two fine-tuned variants of the Llama 2 13B model.
We provide a leaderboard to keep track of the performance of (Dutch) models on a number of generation tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the rapid expansion of types of large language models, there remains
a notable gap in models specifically designed for the Dutch language. This gap
is not only a shortage in terms of pretrained Dutch models but also in terms of
data, and benchmarks and leaderboards. This work provides a small step to
improve the situation. First, we introduce two fine-tuned variants of the Llama
2 13B model. We first fine-tuned Llama 2 using Dutch-specific web-crawled data
and subsequently refined this model further on multiple synthetic instruction
and chat datasets. These datasets as well as the model weights are made
available. In addition, we provide a leaderboard to keep track of the
performance of (Dutch) models on a number of generation tasks, and we include
results of a number of state-of-the-art models, including our own. Finally we
provide a critical conclusion on what we believe is needed to push forward
Dutch language models and the whole eco-system around the models.
Related papers
- Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining [4.38070902806635]
We set up a benchmark for languages Croatian, Serbian, Bosnian and Montenegrin.
We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models.
We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.
arXiv Detail & Related papers (2024-04-08T11:55:44Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - FinGPT: Large Generative Models for a Small Language [48.46240937758779]
We create large language models (LLMs) for Finnish, a language spoken by less than 0.1% of the world population.
We train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT.
We continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
arXiv Detail & Related papers (2023-11-03T08:05:04Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - Confidence-based Ensembles of End-to-End Speech Recognition Models [71.65982591023581]
We show that a confidence-based ensemble of 5 monolingual models outperforms a system where model selection is performed via a dedicated language identification block.
We also demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data.
arXiv Detail & Related papers (2023-06-27T23:13:43Z) - DUMB: A Benchmark for Smart Evaluation of Dutch Models [23.811515104842826]
We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks.
Relative Error Reduction (RER) compares the DUMB performance of language models to a strong baseline.
Highest performance is achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV3 (base)
arXiv Detail & Related papers (2023-05-22T13:27:37Z) - Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z) - Introducing various Semantic Models for Amharic: Experimentation and
Evaluation with multiple Tasks and Datasets [19.855120632909124]
We introduce different semantic models for Amharic.
Models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings.
We find that newly trained models perform better than pre-trained multilingual models.
arXiv Detail & Related papers (2020-11-02T17:48:25Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.