The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
- URL: http://arxiv.org/abs/2303.03915v1
- Date: Tue, 7 Mar 2023 14:25:44 GMT
- Title: The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
- Authors: Hugo Lauren\c{c}on, Lucile Saulnier, Thomas Wang, Christopher Akiki,
Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou,
Eduardo Gonz\'alez Ponferrada, Huu Nguyen, J\"org Frohberg, Mario
\v{S}a\v{s}ko, Quentin Lhoest, Angelina McMillan-Major, Gerard Dupont, Stella
Biderman, Anna Rogers, Loubna Ben allal, Francesco De Toni, Giada Pistilli,
Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la
Rosa, Paulo Villegas, Tristan Thrush, Shayne Longpre, Sebastian Nagel, Leon
Weber, Manuel Mu\~noz, Jian Zhu, Daniel Van Strien, Zaid Alyafeai, Khalid
Almubarak, Minh Chien Vu, Itziar Gonzalez-Dios, Aitor Soroa, Kyle Lo, Manan
Dey, Pedro Ortiz Suarez, Aaron Gokaslan, Shamik Bose, David Adelani, Long
Phan, Hieu Tran, Ian Yu, Suhas Pai, Jenny Chim, Violette Lepercq, Suzana
Ilic, Margaret Mitchell, Sasha Alexandra Luccioni, Yacine Jernite
- Abstract summary: The BigScience workshop was formed with the goal of researching and training large language models as a values-driven undertaking.
This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus.
- Score: 36.98035382552118
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As language models grow ever larger, the need for large-scale high-quality
text datasets has never been more pressing, especially in multilingual
settings. The BigScience workshop, a 1-year international and multidisciplinary
initiative, was formed with the goal of researching and training large language
models as a values-driven undertaking, putting issues of ethics, harm, and
governance in the foreground. This paper documents the data creation and
curation efforts undertaken by BigScience to assemble the Responsible
Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset
spanning 59 languages that was used to train the 176-billion-parameter
BigScience Large Open-science Open-access Multilingual (BLOOM) language model.
We further release a large initial subset of the corpus and analyses thereof,
and hope to empower large-scale monolingual and multilingual modeling projects
with both the data and the processing tools, as well as stimulate research
around this large multilingual corpus.
Related papers
- EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models [50.459861376459656]
EMMA-500 is a large-scale multilingual language model continue-trained on texts across 546 languages.
Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity.
arXiv Detail & Related papers (2024-09-26T14:40:45Z) - Tele-FLM Technical Report [96.19923831660266]
We introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model.
It features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities.
It is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B.
arXiv Detail & Related papers (2024-04-25T14:34:47Z) - A New Massive Multilingual Dataset for High-Performance Language Technologies [14.375854322321997]
The HPLT language resources are a new massive multilingual dataset including both monolingual and bilingual corpora.
Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of 5.6 trillion word tokens de-duplicated on the document level.
Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens.
arXiv Detail & Related papers (2024-03-20T22:14:39Z) - X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment [4.571088742209442]
We create a 91K English-Korean-Chinese multilingual, multimodal training dataset.
We develop a bilingual multimodal model that exhibits excellent performance in both Korean and English.
arXiv Detail & Related papers (2024-03-18T01:14:47Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted
Sentiment Classification Benchmark [7.888702613862612]
This work presents the most extensive open massively multilingual corpus of datasets for training sentiment models.
The corpus consists of 79 manually selected datasets from over 350 datasets reported in the scientific literature.
We present a multi-faceted sentiment classification benchmark summarizing hundreds of experiments conducted on different base models, training objectives, dataset collections, and fine-tuning strategies.
arXiv Detail & Related papers (2023-06-13T16:54:13Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.