Beyond Counting Datasets: A Survey of Multilingual Dataset Construction
and Necessary Resources
- URL: http://arxiv.org/abs/2211.15649v1
- Date: Mon, 28 Nov 2022 18:54:33 GMT
- Title: Beyond Counting Datasets: A Survey of Multilingual Dataset Construction
and Necessary Resources
- Authors: Xinyan Velocity Yu, Akari Asai, Trina Chatterjee, Junjie Hu and Eunsol
Choi
- Abstract summary: We examine the characteristics of 156 publicly available NLP datasets.
We survey language-proficient NLP researchers and crowd workers per language.
We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
- Score: 38.814057529254846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the NLP community is generally aware of resource disparities among
languages, we lack research that quantifies the extent and types of such
disparity. Prior surveys estimating the availability of resources based on the
number of datasets can be misleading as dataset quality varies: many datasets
are automatically induced or translated from English data. To provide a more
comprehensive picture of language resources, we examine the characteristics of
156 publicly available NLP datasets. We manually annotate how they are created,
including input text and label sources and tools used to build them, and what
they study, tasks they address and motivations for their creation. After
quantifying the qualitative NLP resource gap across languages, we discuss how
to improve data collection in low-resource languages. We survey
language-proficient NLP researchers and crowd workers per language, finding
that their estimated availability correlates with dataset availability. Through
crowdsourcing experiments, we identify strategies for collecting high-quality
multilingual data on the Mechanical Turk platform. We conclude by making macro
and micro-level suggestions to the NLP community and individual researchers for
future multilingual data development.
Related papers
- INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [26.13077589552484]
Indic-QA is the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families.
We generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance.
We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages.
arXiv Detail & Related papers (2024-07-18T13:57:16Z) - mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans [27.84922167294656]
It is challenging to curate a dataset for language-specific knowledge and common sense.
Most current multilingual datasets are created through translation, which cannot evaluate such language-specific aspects.
We propose Multilingual CommonsenseQA (mCSQA) based on the construction process of CSQA but leveraging language models for a more efficient construction.
arXiv Detail & Related papers (2024-06-06T16:14:54Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Learning Translation Quality Evaluation on Low Resource Languages from
Large Language Models [4.168157981135698]
We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon learned metrics without requiring human annotators.
We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
arXiv Detail & Related papers (2023-02-07T14:35:35Z) - Dataset Geography: Mapping Language Data to Language Users [17.30955185832338]
We study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers.
In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency.
Last, we explore some geographical and economic factors that may explain the observed distributions dataset.
arXiv Detail & Related papers (2021-12-07T05:13:50Z) - Investigating an approach for low resource language dataset creation,
curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We also create a news topic classification task.
We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.