Documenting Geographically and Contextually Diverse Data Sources: The
BigScience Catalogue of Language Data and Resources
- URL: http://arxiv.org/abs/2201.10066v1
- Date: Tue, 25 Jan 2022 03:05:23 GMT
- Title: Documenting Geographically and Contextually Diverse Data Sources: The
BigScience Catalogue of Language Data and Resources
- Authors: Angelina McMillan-Major and Zaid Alyafeai and Stella Biderman and
Kimbo Chen and Francesco De Toni and G\'erard Dupont and Hady Elsahar and
Chris Emezue and Alham Fikri Aji and Suzana Ili\'c and Nurulaqilla Khamis and
Colin Leong and Maraim Masoud and Aitor Soroa and Pedro Ortiz Suarez and
Zeerak Talat and Daniel van Strien and Yacine Jernite
- Abstract summary: We present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative.
We identify a geographically diverse set of target language groups for which to collect metadata on potential data sources.
To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons.
- Score: 17.69148305999049
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, large-scale data collection efforts have prioritized the
amount of data collected in order to improve the modeling capabilities of large
language models. This prioritization, however, has resulted in concerns with
respect to the rights of data subjects represented in data collections,
particularly when considering the difficulty in interrogating these collections
due to insufficient documentation and tools for analysis. Mindful of these
pitfalls, we present our methodology for a documentation-first, human-centered
data collection project as part of the BigScience initiative. We identified a
geographically diverse set of target language groups (Arabic, Basque, Chinese,
Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages,
Portuguese, Spanish, and Vietnamese, as well as programming languages) for
which to collect metadata on potential data sources. To structure this effort,
we developed our online catalogue as a supporting tool for gathering metadata
through organized public hackathons. We present our development process;
analyses of the resulting resource metadata, including distributions over
languages, regions, and resource types; and our lessons learned in this
endeavor.
Related papers
- Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning
Datasets for Indian Languages [37.79850860981589]
This work introduces an expansive suite of resources specifically designed for the development of Indic LLMs.
Our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data.
For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models.
arXiv Detail & Related papers (2024-03-11T00:46:56Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z) - Beyond Counting Datasets: A Survey of Multilingual Dataset Construction
and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets.
We survey language-proficient NLP researchers and crowd workers per language.
We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented
Dialogue Systems [66.92182084456809]
We introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset from an English ToD dataset.
Our method is based on translating dialogue templates and filling them with local entities in the target-language countries.
We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
arXiv Detail & Related papers (2021-10-14T19:33:04Z) - Low resource language dataset creation, curation and classification:
Setswana and Sepedi -- Extended Abstract [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages.
arXiv Detail & Related papers (2020-03-30T18:03:15Z) - Investigating an approach for low resource language dataset creation,
curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We also create a news topic classification task.
We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.