Related papers: QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

URL: http://arxiv.org/abs/2107.12708v1
Date: Tue, 27 Jul 2021 10:09:13 GMT
Title: QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
Authors: Anna Rogers, Matt Gardner, and Isabelle Augenstein
Abstract summary: This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We also discuss the implications of over-focusing on English, and survey the current monolingual resources for other languages and multilingual resources.
Score: 41.6087902739702
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been also much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with over 80 new datasets appearing in the past two years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of ``reasoning types" in question answering and propose a new taxonomy. We also discuss the implications of over-focusing on English, and survey the current monolingual resources for other languages and multilingual resources. The study is aimed at both practitioners looking for pointers to the wealth of existing data, and at researchers working on new resources.

Related papers

Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA) We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z)
A Study on Scaling Up Multilingual News Framing Analysis [23.80807884935475]
This study explores the possibility of dataset creation through crowdsourcing. We first extend framing analysis beyond English news to a multilingual context. We also present a novel benchmark in Bengali and Portuguese on the immigration and same-sex marriage domains.
arXiv Detail & Related papers (2024-04-01T21:02:18Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets. We survey language-proficient NLP researchers and crowd workers per language. We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z)
A Transfer Learning Pipeline for Educational Resource Discovery with Application in Leading Paragraph Generation [71.92338855383238]
We propose a pipeline that automates web resource discovery for novel domains. The pipeline achieves F1 scores of 0.94 and 0.82 when evaluated on two similar but novel target domains. This is the first study that considers various web resources for survey generation.
arXiv Detail & Related papers (2022-01-07T03:35:40Z)
Studying Taxonomy Enrichment on Diachronic WordNet Versions [70.27072729280528]
We explore the possibilities of taxonomy extension in a resource-poor setting and present methods which are applicable to a large number of languages. We create novel English and Russian datasets for training and evaluating taxonomy enrichment models and describe a technique of creating such datasets for other languages.
arXiv Detail & Related papers (2020-11-23T16:49:37Z)
Low resource language dataset creation, curation and classification: Setswana and Sepedi -- Extended Abstract [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi. We propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages.
arXiv Detail & Related papers (2020-03-30T18:03:15Z)
Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi. We also create a news topic classification task. We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.