LIQUID: A Framework for List Question Answering Dataset Generation
- URL: http://arxiv.org/abs/2302.01691v2
- Date: Mon, 6 Feb 2023 08:04:56 GMT
- Title: LIQUID: A Framework for List Question Answering Dataset Generation
- Authors: Seongyun Lee, Hyunjae Kim, Jaewoo Kang
- Abstract summary: We propose LIQUID, an automated framework for generating list QA datasets from unlabeled corpora.
We first convert a passage from Wikipedia or PubMed into a summary and extract named entities from the summarized text as candidate answers.
We then create questions using an off-the-shelf question generator with the extracted entities and original passage.
Using our synthetic data, we significantly improve the performance of the previous best list QA models by exact-match F1 scores of 5.0 on MultiSpanQA, 1.9 on Quoref, and 2.8 averaged across three BioASQ benchmarks.
- Score: 17.86721740779611
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Question answering (QA) models often rely on large-scale training datasets,
which necessitates the development of a data generation framework to reduce the
cost of manual annotations. Although several recent studies have aimed to
generate synthetic questions with single-span answers, no study has been
conducted on the creation of list questions with multiple, non-contiguous spans
as answers. To address this gap, we propose LIQUID, an automated framework for
generating list QA datasets from unlabeled corpora. We first convert a passage
from Wikipedia or PubMed into a summary and extract named entities from the
summarized text as candidate answers. This allows us to select answers that are
semantically correlated in context and is, therefore, suitable for constructing
list questions. We then create questions using an off-the-shelf question
generator with the extracted entities and original passage. Finally, iterative
filtering and answer expansion are performed to ensure the accuracy and
completeness of the answers. Using our synthetic data, we significantly improve
the performance of the previous best list QA models by exact-match F1 scores of
5.0 on MultiSpanQA, 1.9 on Quoref, and 2.8 averaged across three BioASQ
benchmarks.
Related papers
- PCoQA: Persian Conversational Question Answering Dataset [12.07607688189035]
The PCoQA dataset is a resource comprising information-seeking dialogs encompassing a total of 9,026 contextually-driven questions.
PCoQA is designed to present novel challenges compared to previous question answering datasets.
This paper not only presents the comprehensive PCoQA dataset but also reports the performance of various benchmark models.
arXiv Detail & Related papers (2023-12-07T15:29:34Z) - A Lightweight Method to Generate Unanswerable Questions in English [18.323248259867356]
We examine a simpler data augmentation method for unanswerable question generation in English.
We perform antonym and entity swaps on answerable questions.
Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models.
arXiv Detail & Related papers (2023-10-30T10:14:52Z) - Improving Question Generation with Multi-level Content Planning [70.37285816596527]
This paper addresses the problem of generating questions from a given context and an answer, specifically focusing on questions that require multi-hop reasoning across an extended context.
We propose MultiFactor, a novel QG framework based on multi-level content planning. Specifically, MultiFactor includes two components: FA-model, which simultaneously selects key phrases and generates full answers, and Q-model which takes the generated full answer as an additional input to generate questions.
arXiv Detail & Related papers (2023-10-20T13:57:01Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Activity report analysis with automatic single or multispan answer
extraction [0.21485350418225244]
We create a new smart home environment dataset comprised of questions paired with single-span or multi-span answers depending on the question and context queried.
Our experiments show that the proposed model outperforms state-of-the-art QA models on our dataset.
arXiv Detail & Related papers (2022-09-09T06:33:29Z) - ListReader: Extracting List-form Answers for Opinion Questions [18.50111430378249]
ListReader is a neural ex-tractive QA model for list-form answer.
In addition to learning the alignment between the question and content, we introduce a heterogeneous graph neural network.
Our model adopts a co-extraction setting that can extract either span- or sentence-level answers.
arXiv Detail & Related papers (2021-10-22T10:33:08Z) - GooAQ: Open Question Answering with Diverse Answer Types [63.06454855313667]
We present GooAQ, a large-scale dataset with a variety of answer types.
This dataset contains over 5 million questions and 3 million answers collected from Google.
arXiv Detail & Related papers (2021-04-18T05:40:39Z) - FeTaQA: Free-form Table Question Answering [33.018256483762386]
We introduce FeTaQA, a new dataset with 10K Wikipedia-based table, question, free-form answer, supporting table cells pairs.
FeTaQA yields a more challenging table question answering setting because it requires generating free-form text answers after retrieval, inference, and integration of multiple discontinuous facts from a structured knowledge source.
arXiv Detail & Related papers (2021-04-01T09:59:40Z) - Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question.
Most open QA systems have considered only retrieving information from unstructured text.
We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.