A Pipeline for Generating, Annotating and Employing Synthetic Data for
Real World Question Answering
- URL: http://arxiv.org/abs/2211.16971v1
- Date: Wed, 30 Nov 2022 13:24:30 GMT
- Title: A Pipeline for Generating, Annotating and Employing Synthetic Data for
Real World Question Answering
- Authors: Matthew Maufe, James Ravenscroft, Rob Procter, Maria Liakata
- Abstract summary: Question Answering (QA) is a growing area of research, often used to facilitate the extraction of information from within documents.
We demonstrate that synthetic domain-specific datasets can be generated easily using domain-general models.
- Score: 21.897002626924348
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Question Answering (QA) is a growing area of research, often used to
facilitate the extraction of information from within documents.
State-of-the-art QA models are usually pre-trained on domain-general corpora
like Wikipedia and thus tend to struggle on out-of-domain documents without
fine-tuning. We demonstrate that synthetic domain-specific datasets can be
generated easily using domain-general models, while still providing significant
improvements to QA performance. We present two new tools for this task: A
flexible pipeline for validating the synthetic QA data and training downstream
models on it, and an online interface to facilitate human annotation of this
generated data. Using this interface, crowdworkers labelled 1117 synthetic QA
pairs, which we then used to fine-tune downstream models and improve
domain-specific QA performance by 8.75 F1.
Related papers
- Graph Guided Question Answer Generation for Procedural
Question-Answering [29.169773816553153]
We introduce a method for generating exhaustive and high-quality training data for task-specific question answering (QA) models.
Key technological enabler is a novel mechanism for automatic question-answer generation from procedural text.
We show that small models trained with our data achieve excellent performance on the target QA task, even exceeding that of GPT3 and ChatGPT.
arXiv Detail & Related papers (2024-01-24T17:01:42Z) - Building Interpretable and Reliable Open Information Retriever for New
Domains Overnight [67.03842581848299]
Information retrieval is a critical component for many down-stream tasks such as open-domain question answering (QA)
We propose an information retrieval pipeline that uses entity/event linking model and query decomposition model to focus more accurately on different information units of the query.
We show that, while being more interpretable and reliable, our proposed pipeline significantly improves passage coverages and denotation accuracies across five IR and QA benchmarks.
arXiv Detail & Related papers (2023-08-09T07:47:17Z) - Long-Tailed Question Answering in an Open World [46.67715607552547]
We define Open Long-Tailed QA (OLTQA) as learning from long-tailed distributed data.
We propose an OLTQA model that encourages knowledge sharing between head, tail and unseen tasks.
On a large-scale OLTQA dataset, our model consistently outperforms the state-of-the-art.
arXiv Detail & Related papers (2023-05-11T04:28:58Z) - Chain-of-Skills: A Configurable Model for Open-domain Question Answering [79.8644260578301]
The retrieval model is an indispensable component for real-world knowledge-intensive tasks.
Recent work focuses on customized methods, limiting the model transferability and scalability.
We propose a modular retriever where individual modules correspond to key skills that can be reused across datasets.
arXiv Detail & Related papers (2023-05-04T20:19:39Z) - Tokenization Consistency Matters for Generative Models on Extractive NLP
Tasks [54.306234256074255]
We identify the issue of tokenization inconsistency that is commonly neglected in training generative models.
This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently.
We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets.
arXiv Detail & Related papers (2022-12-19T23:33:21Z) - One-Shot Domain Adaptive and Generalizable Semantic Segmentation with
Class-Aware Cross-Domain Transformers [96.51828911883456]
Unsupervised sim-to-real domain adaptation (UDA) for semantic segmentation aims to improve the real-world test performance of a model trained on simulated data.
Traditional UDA often assumes that there are abundant unlabeled real-world data samples available during training for the adaptation.
We explore the one-shot unsupervised sim-to-real domain adaptation (OSUDA) and generalization problem, where only one real-world data sample is available.
arXiv Detail & Related papers (2022-12-14T15:54:15Z) - Contrastive Domain Adaptation for Question Answering using Limited Text
Corpora [20.116147632481983]
We propose a novel framework for domain adaptation called contrastive domain adaptation for QA.
Specifically, CAQA combines techniques from question generation and domain-invariant learning to answer out-of-domain questions in settings with limited text corpora.
arXiv Detail & Related papers (2021-08-31T14:05:55Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.