QASnowball: An Iterative Bootstrapping Framework for High-Quality
Question-Answering Data Generation
- URL: http://arxiv.org/abs/2309.10326v2
- Date: Wed, 20 Sep 2023 01:57:10 GMT
- Title: QASnowball: An Iterative Bootstrapping Framework for High-Quality
Question-Answering Data Generation
- Authors: Kunlun Zhu, Shihao Liang, Xu Han, Zhi Zheng, Guoyang Zeng, Zhiyuan
Liu, Maosong Sun
- Abstract summary: We introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball)
QASnowball can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples.
We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models.
- Score: 67.27999343730224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed the success of question answering (QA),
especially its potential to be a foundation paradigm for tackling diverse NLP
tasks. However, obtaining sufficient data to build an effective and stable QA
system still remains an open problem. For this problem, we introduce an
iterative bootstrapping framework for QA data augmentation (named QASnowball),
which can iteratively generate large-scale high-quality QA data based on a seed
set of supervised examples. Specifically, QASnowball consists of three modules,
an answer extractor to extract core phrases in unlabeled documents as candidate
answers, a question generator to generate questions based on documents and
candidate answers, and a QA data filter to filter out high-quality QA data.
Moreover, QASnowball can be self-enhanced by reseeding the seed set to
fine-tune itself in different iterations, leading to continual improvements in
the generation quality. We conduct experiments in the high-resource English
scenario and the medium-resource Chinese scenario, and the experimental results
show that the data generated by QASnowball can facilitate QA models: (1)
training models on the generated data achieves comparable results to using
supervised data, and (2) pre-training on the generated data and fine-tuning on
supervised data can achieve better performance. Our code and generated data
will be released to advance further work.
Related papers
- A Lightweight Method to Generate Unanswerable Questions in English [18.323248259867356]
We examine a simpler data augmentation method for unanswerable question generation in English.
We perform antonym and entity swaps on answerable questions.
Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models.
arXiv Detail & Related papers (2023-10-30T10:14:52Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Improving Question Answering with Generation of NQ-like Questions [12.276281998447079]
Question Answering (QA) systems require a large amount of annotated data which is costly and time-consuming to gather.
We propose an algorithm to automatically generate shorter questions resembling day-to-day human communication in the Natural Questions (NQ) dataset from longer trivia questions in Quizbowl (QB) dataset.
arXiv Detail & Related papers (2022-10-12T21:36:20Z) - Improving Unsupervised Question Answering via Summarization-Informed
Question Generation [47.96911338198302]
Question Generation (QG) is the task of generating a plausible question for a passage, answer> pair.
We make use of freely available news summary data, transforming declarative sentences into appropriate questions using dependency parsing, named entity recognition and semantic role labeling.
The resulting questions are then combined with the original news articles to train an end-to-end neural QG model.
arXiv Detail & Related papers (2021-09-16T13:08:43Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.