Related papers: A Pipeline for Generating, Annotating and Employing Synthetic Data for Real World Question Answering

A Pipeline for Generating, Annotating and Employing Synthetic Data for Real World Question Answering

URL: http://arxiv.org/abs/2211.16971v1
Date: Wed, 30 Nov 2022 13:24:30 GMT
Title: A Pipeline for Generating, Annotating and Employing Synthetic Data for Real World Question Answering
Authors: Matthew Maufe, James Ravenscroft, Rob Procter, Maria Liakata
Abstract summary: Question Answering (QA) is a growing area of research, often used to facilitate the extraction of information from within documents. We demonstrate that synthetic domain-specific datasets can be generated easily using domain-general models.
Score: 21.897002626924348
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Question Answering (QA) is a growing area of research, often used to facilitate the extraction of information from within documents. State-of-the-art QA models are usually pre-trained on domain-general corpora like Wikipedia and thus tend to struggle on out-of-domain documents without fine-tuning. We demonstrate that synthetic domain-specific datasets can be generated easily using domain-general models, while still providing significant improvements to QA performance. We present two new tools for this task: A flexible pipeline for validating the synthetic QA data and training downstream models on it, and an online interface to facilitate human annotation of this generated data. Using this interface, crowdworkers labelled 1117 synthetic QA pairs, which we then used to fine-tune downstream models and improve domain-specific QA performance by 8.75 F1.

Related papers

TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data [9.390415313514762]
TARGA is a framework that generates high-relevance synthetic data without manual annotation. It substantially outperforms existing non-fine-tuned methods that utilize close-sourced model. It exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.
arXiv Detail & Related papers (2024-12-27T09:16:39Z)
Graph Guided Question Answer Generation for Procedural Question-Answering [29.169773816553153]
We introduce a method for generating exhaustive and high-quality training data for task-specific question answering (QA) models. Key technological enabler is a novel mechanism for automatic question-answer generation from procedural text. We show that small models trained with our data achieve excellent performance on the target QA task, even exceeding that of GPT3 and ChatGPT.
arXiv Detail & Related papers (2024-01-24T17:01:42Z)
Building Interpretable and Reliable Open Information Retriever for New Domains Overnight [67.03842581848299]
Information retrieval is a critical component for many down-stream tasks such as open-domain question answering (QA) We propose an information retrieval pipeline that uses entity/event linking model and query decomposition model to focus more accurately on different information units of the query. We show that, while being more interpretable and reliable, our proposed pipeline significantly improves passage coverages and denotation accuracies across five IR and QA benchmarks.
arXiv Detail & Related papers (2023-08-09T07:47:17Z)
Long-Tailed Question Answering in an Open World [46.67715607552547]
We define Open Long-Tailed QA (OLTQA) as learning from long-tailed distributed data. We propose an OLTQA model that encourages knowledge sharing between head, tail and unseen tasks. On a large-scale OLTQA dataset, our model consistently outperforms the state-of-the-art.
arXiv Detail & Related papers (2023-05-11T04:28:58Z)
Chain-of-Skills: A Configurable Model for Open-domain Question Answering [79.8644260578301]
The retrieval model is an indispensable component for real-world knowledge-intensive tasks. Recent work focuses on customized methods, limiting the model transferability and scalability. We propose a modular retriever where individual modules correspond to key skills that can be reused across datasets.
arXiv Detail & Related papers (2023-05-04T20:19:39Z)
Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks [54.306234256074255]
We identify the issue of tokenization inconsistency that is commonly neglected in training generative models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets.
arXiv Detail & Related papers (2022-12-19T23:33:21Z)
One-Shot Domain Adaptive and Generalizable Semantic Segmentation with Class-Aware Cross-Domain Transformers [96.51828911883456]
Unsupervised sim-to-real domain adaptation (UDA) for semantic segmentation aims to improve the real-world test performance of a model trained on simulated data. Traditional UDA often assumes that there are abundant unlabeled real-world data samples available during training for the adaptation. We explore the one-shot unsupervised sim-to-real domain adaptation (OSUDA) and generalization problem, where only one real-world data sample is available.
arXiv Detail & Related papers (2022-12-14T15:54:15Z)
Relation-Guided Pre-Training for Open-Domain Question Answering [67.86958978322188]
We propose a Relation-Guided Pre-Training (RGPT-QA) framework to solve complex open-domain questions. We show that RGPT-QA achieves 2.2%, 2.4%, and 6.3% absolute improvement in Exact Match accuracy on Natural Questions, TriviaQA, and WebQuestions.
arXiv Detail & Related papers (2021-09-21T17:59:31Z)
Contrastive Domain Adaptation for Question Answering using Limited Text Corpora [20.116147632481983]
We propose a novel framework for domain adaptation called contrastive domain adaptation for QA. Specifically, CAQA combines techniques from question generation and domain-invariant learning to answer out-of-domain questions in settings with limited text corpora.
arXiv Detail & Related papers (2021-08-31T14:05:55Z)
Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts. Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data. We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.