MK-SQuIT: Synthesizing Questions using Iterative Template-filling
- URL: http://arxiv.org/abs/2011.02566v1
- Date: Wed, 4 Nov 2020 22:33:05 GMT
- Title: MK-SQuIT: Synthesizing Questions using Iterative Template-filling
- Authors: Benjamin A. Spiegel, Vincent Cheong, James E. Kaplan, Anthony Sanchez
- Abstract summary: We create a framework for synthetically generating question/query pairs with as little human input as possible.
These datasets can be used to train machine translation systems to convert natural language questions into queries.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The aim of this work is to create a framework for synthetically generating
question/query pairs with as little human input as possible. These datasets can
be used to train machine translation systems to convert natural language
questions into queries, a useful tool that could allow for more natural access
to database information. Existing methods of dataset generation require human
input that scales linearly with the size of the dataset, resulting in small
datasets. Aside from a short initial configuration task, no human input is
required during the query generation process of our system. We leverage
WikiData, a knowledge base of RDF triples, as a source for generating the main
content of questions and queries. Using multiple layers of question templating
we are able to sidestep some of the most challenging parts of query generation
that have been handled by humans in previous methods; humans never have to
modify, aggregate, inspect, annotate, or generate any questions or queries at
any step in the process. Our system is easily configurable to multiple domains
and can be modified to generate queries in natural languages other than
English. We also present an example dataset of 110,000 question/query pairs
across four WikiData domains. We then present a baseline model that we train
using the dataset which shows promise in a commercial QA setting.
Related papers
- Text2SQL is Not Enough: Unifying AI and Databases with TAG [47.45480855418987]
Table-Augmented Generation (TAG) is a paradigm for answering natural language questions over databases.
We develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly.
arXiv Detail & Related papers (2024-08-27T00:50:14Z) - A Lightweight Method to Generate Unanswerable Questions in English [18.323248259867356]
We examine a simpler data augmentation method for unanswerable question generation in English.
We perform antonym and entity swaps on answerable questions.
Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models.
arXiv Detail & Related papers (2023-10-30T10:14:52Z) - A Practical Toolkit for Multilingual Question and Answer Generation [79.31199020420827]
We introduce AutoQG, an online service for multilingual QAG, along with lmqg, an all-in-one Python package for model fine-tuning, generation, and evaluation.
We also release QAG models in eight languages fine-tuned on a few variants of pre-trained encoder-decoder language models, which can be used online via AutoQG or locally via lmqg.
arXiv Detail & Related papers (2023-05-27T08:42:37Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open
Domain Question Answering [78.9863753810787]
A large amount of world's knowledge is stored in structured databases.
query languages can answer questions that require complex reasoning, as well as offering full explainability.
arXiv Detail & Related papers (2021-08-05T22:04:13Z) - VANiLLa : Verbalized Answers in Natural Language at Large Scale [2.9098477555578333]
This dataset consists of over 100k simple questions adapted from the CSQA and SimpleQuestionsWikidata datasets.
The answer sentences in this dataset are syntactically and semantically closer to the question than to the triple fact.
arXiv Detail & Related papers (2021-05-24T16:57:54Z) - Answering Open-Domain Questions of Varying Reasoning Steps from Text [39.48011017748654]
We develop a unified system to answer directly from text open-domain questions.
We employ a single multi-task transformer model to perform all the necessary subtasks.
We show that our model demonstrates competitive performance on both existing benchmarks and this new benchmark.
arXiv Detail & Related papers (2020-10-23T16:51:09Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Efficient Deployment of Conversational Natural Language Interfaces over
Databases [45.52672694140881]
We propose a novel method for accelerating the training dataset collection for developing the natural language-to-query-language machine learning models.
Our system allows one to generate conversational multi-term data, where multiple turns define a dialogue session.
arXiv Detail & Related papers (2020-05-31T19:16:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.