QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering
and Reading Comprehension
- URL: http://arxiv.org/abs/2107.12708v1
- Date: Tue, 27 Jul 2021 10:09:13 GMT
- Title: QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering
and Reading Comprehension
- Authors: Anna Rogers, Matt Gardner, and Isabelle Augenstein
- Abstract summary: This study is the largest survey of the field to date.
We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work.
We also discuss the implications of over-focusing on English, and survey the current monolingual resources for other languages and multilingual resources.
- Score: 41.6087902739702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Alongside huge volumes of research on deep learning models in NLP in the
recent years, there has been also much work on benchmark datasets needed to
track modeling progress. Question answering and reading comprehension have been
particularly prolific in this regard, with over 80 new datasets appearing in
the past two years. This study is the largest survey of the field to date. We
provide an overview of the various formats and domains of the current
resources, highlighting the current lacunae for future work. We further discuss
the current classifications of ``reasoning types" in question answering and
propose a new taxonomy. We also discuss the implications of over-focusing on
English, and survey the current monolingual resources for other languages and
multilingual resources. The study is aimed at both practitioners looking for
pointers to the wealth of existing data, and at researchers working on new
resources.
Related papers
- Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - A Study on Scaling Up Multilingual News Framing Analysis [23.80807884935475]
This study explores the possibility of dataset creation through crowdsourcing.
We first extend framing analysis beyond English news to a multilingual context.
We also present a novel benchmark in Bengali and Portuguese on the immigration and same-sex marriage domains.
arXiv Detail & Related papers (2024-04-01T21:02:18Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Beyond Counting Datasets: A Survey of Multilingual Dataset Construction
and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets.
We survey language-proficient NLP researchers and crowd workers per language.
We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z) - A Transfer Learning Pipeline for Educational Resource Discovery with
Application in Leading Paragraph Generation [71.92338855383238]
We propose a pipeline that automates web resource discovery for novel domains.
The pipeline achieves F1 scores of 0.94 and 0.82 when evaluated on two similar but novel target domains.
This is the first study that considers various web resources for survey generation.
arXiv Detail & Related papers (2022-01-07T03:35:40Z) - Studying Taxonomy Enrichment on Diachronic WordNet Versions [70.27072729280528]
We explore the possibilities of taxonomy extension in a resource-poor setting and present methods which are applicable to a large number of languages.
We create novel English and Russian datasets for training and evaluating taxonomy enrichment models and describe a technique of creating such datasets for other languages.
arXiv Detail & Related papers (2020-11-23T16:49:37Z) - Low resource language dataset creation, curation and classification:
Setswana and Sepedi -- Extended Abstract [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages.
arXiv Detail & Related papers (2020-03-30T18:03:15Z) - Investigating an approach for low resource language dataset creation,
curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We also create a news topic classification task.
We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.