CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training
- URL: http://arxiv.org/abs/2110.07731v1
- Date: Thu, 14 Oct 2021 21:23:01 GMT
- Title: CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training
- Authors: Patrick Huber, Armen Aghajanyan, Barlas O\u{g}uz, Dmytro Okhonko,
Wen-tau Yih, Sonal Gupta, Xilun Chen
- Abstract summary: We propose a novel question-answering dataset based on the Common Crawl project in this paper.
We extract around 130 million multilingual question-answer pairs, including about 60 million English data-points.
With this previously unseen number of natural QA pairs, we pre-train popular language models to show the potential of large-scale in-domain pre-training for the task of question-answering.
- Score: 21.07506671340319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rise of large-scale pre-trained language models, open-domain
question-answering (ODQA) has become an important research topic in NLP. Based
on the popular pre-training fine-tuning approach, we posit that an additional
in-domain pre-training stage using a large-scale, natural, and diverse
question-answering (QA) dataset can be beneficial for ODQA. Consequently, we
propose a novel QA dataset based on the Common Crawl project in this paper.
Using the readily available schema.org annotation, we extract around 130
million multilingual question-answer pairs, including about 60 million English
data-points. With this previously unseen number of natural QA pairs, we
pre-train popular language models to show the potential of large-scale
in-domain pre-training for the task of question-answering. In our experiments,
we find that pre-training question-answering models on our Common Crawl
Question Answering dataset (CCQA) achieves promising results in zero-shot, low
resource and fine-tuned settings across multiple tasks, models and benchmarks.
Related papers
- QASnowball: An Iterative Bootstrapping Framework for High-Quality
Question-Answering Data Generation [67.27999343730224]
We introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball)
QASnowball can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples.
We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models.
arXiv Detail & Related papers (2023-09-19T05:20:36Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - Few-shot Multi-hop Question Answering over Knowledge Base [0.0]
This paper proposes an efficient pipeline method equipped with a pre-trained language model and a strategy to construct artificial training samples.
We evaluate our model on CCKS 2019 Complex Question Answering via Knowledge Base task and achieves F1-score of 62.55% on the test dataset.
arXiv Detail & Related papers (2021-12-14T00:56:54Z) - Few-Shot Question Answering by Pretraining Span Selection [58.31911597824848]
We explore the more realistic few-shot setting, where only a few hundred training examples are available.
We show that standard span selection models perform poorly, highlighting the fact that current pretraining objective are far removed from question answering.
Our findings indicate that careful design of pretraining schemes and model architecture can have a dramatic effect on performance in the few-shot settings.
arXiv Detail & Related papers (2021-01-02T11:58:44Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z) - MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering
Models [25.398047573530985]
Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus.
This paper presents MultiReQA, anew multi-domain ReQA evaluation suite com-posed of eight retrieval QA tasks drawn from publicly available QA datasets.
arXiv Detail & Related papers (2020-05-05T21:30:16Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.