The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling
- URL: http://arxiv.org/abs/2303.17183v1
- Date: Thu, 30 Mar 2023 06:42:22 GMT
- Title: The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling
- Authors: Joey \"Ohman, Severine Verlinden, Ariel Ekgren, Amaru Cuba Gyllensten,
Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Magnus Sahlgren
- Abstract summary: We curate a high-quality dataset consisting of 1.2TB of text in all of the major North Germanic languages.
This paper details our considerations and processes for collecting, cleaning, and filtering the dataset.
- Score: 5.687459576800633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training Large Language Models (LLMs) require massive amounts of text
data, and the performance of the LLMs typically correlates with the scale and
quality of the datasets. This means that it may be challenging to build LLMs
for smaller languages such as Nordic ones, where the availability of text
corpora is limited. In order to facilitate the development of the LLMS in the
Nordic languages, we curate a high-quality dataset consisting of 1.2TB of text,
in all of the major North Germanic languages (Danish, Icelandic, Norwegian, and
Swedish), as well as some high-quality English data. This paper details our
considerations and processes for collecting, cleaning, and filtering the
dataset.
Related papers
- EuroLLM: Multilingual Language Models for Europe [76.89545643715368]
We introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs.
We outline the progress made to date, detailing our data collection and filtering process.
We report our performance on multilingual general benchmarks and machine translation.
arXiv Detail & Related papers (2024-09-24T16:51:36Z) - LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback [61.23008372927665]
We introduce xLLMs-100, which scales the multilingual capabilities of LLaMA and BLOOM to 100 languages.
We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks.
arXiv Detail & Related papers (2024-06-03T20:25:12Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks.
To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models.
We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages.
We leverage parallel translations of the Bible to construct such a dataset.
By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z) - ScandEval: A Benchmark for Scandinavian Natural Language Processing [0.0]
This paper introduces a Scandinavian benchmarking platform, ScandEval, which can benchmark any pretrained model on four different tasks in the Scandinavian languages.
The datasets used in two of the tasks, linguistic acceptability and question answering, are new.
We develop and release a Python package and command-line interface, scandeval, which can benchmark any model that has been uploaded to the Hugging Face Hub, with reproducible results.
arXiv Detail & Related papers (2023-04-03T11:51:46Z) - Large-Scale Contextualised Language Modelling for Norwegian [7.5722195869569]
This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks.
In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian.
arXiv Detail & Related papers (2021-04-13T23:18:04Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.