Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning
- URL: http://arxiv.org/abs/2402.06619v1
- Date: Fri, 9 Feb 2024 18:51:49 GMT
- Title: Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning
- Authors: Shivalika Singh, Freddie Vargus, Daniel Dsouza, B\"orje F. Karlsson,
Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas
Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson,
Marina Machado, Luisa Souza Moura, Dominik Krzemi\'nski, Hakimeh Fadaei, Irem
Erg\"un, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu
Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian
Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet \"Ust\"un,
Marzieh Fadaee, Sara Hooker
- Abstract summary: Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
- Score: 49.79783940841352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Datasets are foundational to many breakthroughs in modern artificial
intelligence. Many recent achievements in the space of natural language
processing (NLP) can be attributed to the finetuning of pre-trained models on a
diverse set of tasks that enables a large language model (LLM) to respond to
instructions. Instruction fine-tuning (IFT) requires specifically constructed
and annotated datasets. However, existing datasets are almost all in the
English language. In this work, our primary goal is to bridge the language gap
by building a human-curated instruction-following dataset spanning 65
languages. We worked with fluent speakers of languages from around the world to
collect natural instances of instructions and completions. Furthermore, we
create the most extensive multilingual collection to date, comprising 513
million instances through templating and translating existing datasets across
114 languages. In total, we contribute four key resources: we develop and
open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection,
and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case
study in participatory research, involving collaborators from 119 countries. We
see this as a valuable framework for future research collaborations that aim to
bridge gaps in resources.
Related papers
- Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Aya Model: An Instruction Finetuned Open-Access Multilingual Language
Model [33.87586041774359]
Aya is a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced.
We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages.
We conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
arXiv Detail & Related papers (2024-02-12T17:34:13Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages.
We leverage parallel translations of the Bible to construct such a dataset.
By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z) - Beyond Counting Datasets: A Survey of Multilingual Dataset Construction
and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets.
We survey language-proficient NLP researchers and crowd workers per language.
We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented
Dialogue Systems [66.92182084456809]
We introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset from an English ToD dataset.
Our method is based on translating dialogue templates and filling them with local entities in the target-language countries.
We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
arXiv Detail & Related papers (2021-10-14T19:33:04Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.