Weakly Supervised Mapping of Natural Language to SQL through Question
Decomposition
- URL: http://arxiv.org/abs/2112.06311v1
- Date: Sun, 12 Dec 2021 20:02:42 GMT
- Title: Weakly Supervised Mapping of Natural Language to SQL through Question
Decomposition
- Authors: Tomer Wolfson, Jonathan Berant and Daniel Deutch
- Abstract summary: We propose an alternative approach for training machine learning-based NLIDBs, using weak supervision.
We use the recently proposed question decomposition representation called QDMR, an intermediate between NL and formal query languages.
Our solution, requiring zero expert annotations, performs competitively with models trained on expert annotated data.
- Score: 39.32886310973576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural Language Interfaces to Databases (NLIDBs), where users pose queries
in Natural Language (NL), are crucial for enabling non-experts to gain insights
from data. Developing such interfaces, by contrast, is dependent on experts who
often code heuristics for mapping NL to SQL. Alternatively, NLIDBs based on
machine learning models rely on supervised examples of NL to SQL mappings
(NL-SQL pairs) used as training data. Such examples are again procured using
experts, which typically involves more than a one-off interaction. Namely, each
data domain in which the NLIDB is deployed may have different characteristics
and therefore require either dedicated heuristics or domain-specific training
examples. To this end, we propose an alternative approach for training machine
learning-based NLIDBs, using weak supervision. We use the recently proposed
question decomposition representation called QDMR, an intermediate between NL
and formal query languages. Recent work has shown that non-experts are
generally successful in translating NL to QDMR. We consequently use NL-QDMR
pairs, along with the question answers, as supervision for automatically
synthesizing SQL queries. The NL questions and synthesized SQL are then used to
train NL-to-SQL models, which we test on five benchmark datasets. Extensive
experiments show that our solution, requiring zero expert annotations, performs
competitively with models trained on expert annotated data.
Related papers
- Fine-Tuning Language Models for Context-Specific SQL Query Generation [0.0]
This paper presents a novel approach to fine-tuning open-source large language models (LLMs) for the task of transforming natural language intosql queries.
We introduce models specialized in generatingsql queries, trained on synthetic datasets tailored to the Snowflake SQL and Google dialects.
Our methodology involves generating a context-specific dataset using GPT-4, then fine-tuning three open-source LLMs(Starcoder Plus, Code-Llama, and Mistral) employing the LoRa technique to optimize for resource constraints.
The fine-tuned models demonstrate superior performance in zero-shot settings compared to the baseline GP
arXiv Detail & Related papers (2023-12-04T18:04:27Z) - Natural language to SQL in low-code platforms [0.0]
We propose a pipeline allowing developers to write natural language (NL) queries.
We collect, label, and validate data covering the queries most often performed by OutSystems users.
We describe the entire pipeline, which comprises a feedback loop that allows us to quickly collect production data.
arXiv Detail & Related papers (2023-08-29T11:59:02Z) - ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural
Language to SQL Systems [16.33799752421288]
We introduce ScienceBenchmark, a new complex NL-to- benchmark for three real-world, highly domain-specific databases.
We show that our benchmark is highly challenging, as the top performing systems on Spider achieve a very low performance on our benchmark.
arXiv Detail & Related papers (2023-06-07T19:37:55Z) - STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing [64.80483736666123]
We propose a novel pre-training framework STAR for context-dependent text-to- parsing.
In addition, we construct a large-scale context-dependent text-to-the-art conversation corpus to pre-train STAR.
Extensive experiments show that STAR achieves new state-of-the-art performance on two downstream benchmarks.
arXiv Detail & Related papers (2022-10-21T11:30:07Z) - A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future
Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases.
Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z) - "What Do You Mean by That?" A Parser-Independent Interactive Approach
for Enhancing Text-to-SQL [49.85635994436742]
We include human in the loop and present a novel-independent interactive approach (PIIA) that interacts with users using multi-choice questions.
PIIA is capable of enhancing the text-to-domain performance with limited interaction turns by using both simulation and human evaluation.
arXiv Detail & Related papers (2020-11-09T02:14:33Z) - Data Agnostic RoBERTa-based Natural Language to SQL Query Generation [0.0]
The NL2 task aims at finding deep learning approaches to solve the problem converting by natural language questions into valid queries.
We have presented an approach with data privacy at its core.
Although we have not achieved state of the art results, we have eliminated the need for the table right from the training of the model.
arXiv Detail & Related papers (2020-10-11T13:18:46Z) - Photon: A Robust Cross-Domain Text-to-SQL System [189.1405317853752]
We present Photon, a robust, modular, cross-domain NLIDB that can flag natural language input to which a mapping cannot be immediately determined.
The proposed method effectively improves the robustness of text-to-native system against untranslatable user input.
arXiv Detail & Related papers (2020-07-30T07:44:48Z) - ValueNet: A Natural Language-to-SQL System that Learns from Database
Information [4.788755317132195]
Building natural language interfaces for databases has been a long-standing challenge.
Recent focus of research has been on neural networks to tackle this challenge on complex datasets like Spider.
We propose two end-to-end NL-to-end systems that incorporate values using the challenging Spider.
arXiv Detail & Related papers (2020-05-29T15:43:39Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.