A Methodology for Creating Question Answering Corpora Using Inverse Data
Annotation
- URL: http://arxiv.org/abs/2004.07633v2
- Date: Thu, 25 Jun 2020 08:13:32 GMT
- Title: A Methodology for Creating Question Answering Corpora Using Inverse Data
Annotation
- Authors: Jan Deriu, Katsiaryna Mlynchyk, Philippe Schl\"apfer, Alvaro Rodrigo,
Dirk von Gr\"unigen, Nicolas Kaiser, Kurt Stockinger, Eneko Agirre, and Mark
Cieliebak
- Abstract summary: We introduce a novel methodology to efficiently construct a corpus for question answering over structured data.
In our method, we randomly generate OTs from a context-free grammar.
We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus.
- Score: 16.914116942666976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce a novel methodology to efficiently construct a
corpus for question answering over structured data. For this, we introduce an
intermediate representation that is based on the logical query plan in a
database called Operation Trees (OT). This representation allows us to invert
the annotation process without losing flexibility in the types of queries that
we generate. Furthermore, it allows for fine-grained alignment of query tokens
to OT operations. In our method, we randomly generate OTs from a context-free
grammar. Afterwards, annotators have to write the appropriate natural language
question that is represented by the OT. Finally, the annotators assign the
tokens to the OT operations. We apply the method to create a new corpus OTTA
(Operation Trees and Token Assignment), a large semantic parsing corpus for
evaluating natural language interfaces to databases. We compare OTTA to Spider
and LC-QuaD 2.0 and show that our methodology more than triples the annotation
speed while maintaining the complexity of the queries. Finally, we train a
state-of-the-art semantic parsing model on our data and show that our corpus is
a challenging dataset and that the token alignment can be leveraged to increase
the performance significantly.
Related papers
- Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning [10.731045939849125]
We focus on Text-to- semantic parsing from the perspective of retrieval-augmented generation.
Motivated by challenges related to the size of commercial database schemata and the deployability of business intelligence solutions, we propose $textASTReS$ that dynamically retrieves input database information.
arXiv Detail & Related papers (2024-07-03T15:55:14Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset [6.633914491587503]
We propose to generate a synthetic context retrieval training dataset using Alpaca.
Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER.
We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
arXiv Detail & Related papers (2023-10-16T06:53:12Z) - Walking Down the Memory Maze: Beyond Context Limit through Interactive
Reading [63.93888816206071]
We introduce MemWalker, a method that processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information.
We show that, beyond effective reading, MemWalker enhances explainability by highlighting the reasoning steps as it interactively reads the text; pinpointing the relevant text segments related to the query.
arXiv Detail & Related papers (2023-10-08T06:18:14Z) - STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing [64.80483736666123]
We propose a novel pre-training framework STAR for context-dependent text-to- parsing.
In addition, we construct a large-scale context-dependent text-to-the-art conversation corpus to pre-train STAR.
Extensive experiments show that STAR achieves new state-of-the-art performance on two downstream benchmarks.
arXiv Detail & Related papers (2022-10-21T11:30:07Z) - Generating Synthetic Data for Task-Oriented Semantic Parsing with
Hierarchical Representations [0.8203855808943658]
In this work, we explore the possibility of generating synthetic data for neural semantic parsing.
Specifically, we first extract masked templates from the existing labeled utterances, and then fine-tune BART to generate synthetic utterances conditioning.
We show the potential of our approach when evaluating on the Facebook TOP dataset for navigation domain.
arXiv Detail & Related papers (2020-11-03T22:55:40Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.