Importance of Synthesizing High-quality Data for Text-to-SQL Parsing
- URL: http://arxiv.org/abs/2212.08785v1
- Date: Sat, 17 Dec 2022 02:53:21 GMT
- Title: Importance of Synthesizing High-quality Data for Text-to-SQL Parsing
- Authors: Yiyun Zhao, Jiarong Jiang, Yiqun Hu, Wuwei Lan, Henry Zhu, Anuj
Chauhan, Alexander Li, Lin Pan, Jun Wang, Chung-Wei Hang, Sheng Zhang, Marvin
Dong, Joe Lilien, Patrick Ng, Zhiguo Wang, Vittorio Castelli, Bing Xiang
- Abstract summary: State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
- Score: 71.02856634369174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been increasing interest in synthesizing data to improve
downstream text-to-SQL tasks. In this paper, we first examined the existing
synthesized datasets and discovered that state-of-the-art text-to-SQL
algorithms did not further improve on popular benchmarks when trained with
augmented synthetic data. We observed two shortcomings: illogical synthetic SQL
queries from independent column sampling and arbitrary table joins. To address
these issues, we propose a novel synthesis framework that incorporates key
relationships from schema, imposes strong typing, and conducts
schema-distance-weighted column sampling. We also adopt an intermediate
representation (IR) for the SQL-to-text task to further improve the quality of
the generated natural language questions. When existing powerful semantic
parsers are pre-finetuned on our high-quality synthesized data, our experiments
show that these models have significant accuracy boosts on popular benchmarks,
including new state-of-the-art performance on Spider.
Related papers
- Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks.
We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z) - Diverse Parallel Data Synthesis for Cross-Database Adaptation of
Text-to-SQL Parsers [21.272952382662215]
Adapting to new databases is a challenging problem due to the lack of natural language queries in the new schemas.
We present ReFill, a framework for adapting a Text-to-edit to a target schema.
arXiv Detail & Related papers (2022-10-29T14:30:53Z) - Augmenting Multi-Turn Text-to-SQL Datasets with Self-Play [46.07002748587857]
We explore augmenting the training datasets using self-play, which leverages contextual information to synthesize new interactions.
We find that self-play improves the accuracy of a strong baseline on SParC and Co, two widely used text-to-domain datasets.
arXiv Detail & Related papers (2022-10-21T16:40:07Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z) - S$^2$SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder
for Text-to-SQL Parsers [66.78665327694625]
We propose S$2$, injecting Syntax to question- encoder graph for Text-to- relational parsing.
We also employ the decoupling constraint to induce diverse edge embedding, which further improves the network's performance.
Experiments on the Spider and robustness setting Spider-Syn demonstrate that the proposed approach outperforms all existing methods when pre-training models are used.
arXiv Detail & Related papers (2022-03-14T09:49:15Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.