Structure-Grounded Pretraining for Text-to-SQL
- URL: http://arxiv.org/abs/2010.12773v3
- Date: Wed, 31 Aug 2022 00:19:41 GMT
- Title: Structure-Grounded Pretraining for Text-to-SQL
- Authors: Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Oleksandr
Polozov, Huan Sun, Matthew Richardson
- Abstract summary: We present a novel weakly supervised StructureStrued pretraining framework (G) for text-to-LARGE.
We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder.
- Score: 75.19554243393814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to capture text-table alignment is essential for tasks like
text-to-SQL. A model needs to correctly recognize natural language references
to columns and values and to ground them in the given database schema. In this
paper, we present a novel weakly supervised Structure-Grounded pretraining
framework (StruG) for text-to-SQL that can effectively learn to capture
text-table alignment based on a parallel text-table corpus. We identify a set
of novel prediction tasks: column grounding, value grounding and column-value
mapping, and leverage them to pretrain a text-table encoder. Additionally, to
evaluate different methods under more realistic text-table alignment settings,
we create a new evaluation set Spider-Realistic based on Spider dev set with
explicit mentions of column names removed, and adopt eight existing text-to-SQL
datasets for cross-database evaluation. STRUG brings significant improvement
over BERT-LARGE in all settings. Compared with existing pretraining methods
such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms
all baselines on more realistic sets. The Spider-Realistic dataset is available
at https://doi.org/10.5281/zenodo.5205322.
Related papers
- UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Augmenting Multi-Turn Text-to-SQL Datasets with Self-Play [46.07002748587857]
We explore augmenting the training datasets using self-play, which leverages contextual information to synthesize new interactions.
We find that self-play improves the accuracy of a strong baseline on SParC and Co, two widely used text-to-domain datasets.
arXiv Detail & Related papers (2022-10-21T16:40:07Z) - STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing [64.80483736666123]
We propose a novel pre-training framework STAR for context-dependent text-to- parsing.
In addition, we construct a large-scale context-dependent text-to-the-art conversation corpus to pre-train STAR.
Extensive experiments show that STAR achieves new state-of-the-art performance on two downstream benchmarks.
arXiv Detail & Related papers (2022-10-21T11:30:07Z) - A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future
Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases.
Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z) - Self-supervised Text-to-SQL Learning with Header Alignment Training [4.518012967046983]
Self-supervised learning is a de-facto component for the recent success of deep learning in various fields.
We propose a novel self-supervised learning framework to tackle discrepancy between a self-supervised learning objective and a task-specific objective.
Our method is effective for training the model with scarce labeled data.
arXiv Detail & Related papers (2021-03-11T01:09:59Z) - GP: Context-free Grammar Pre-training for Text-to-SQL Parsers [7.652782364282768]
Grammar Pre-training (GP) is proposed to decode deep relations between question and database.
Experiments show that our method is easier to converge during training and has excellent robustness.
arXiv Detail & Related papers (2021-01-25T05:41:31Z) - Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic
Parsing [110.97778888305506]
BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question.
BRIDGE attained state-of-the-art performance on popular cross-DB text-to- relational benchmarks.
Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
arXiv Detail & Related papers (2020-12-23T12:33:52Z) - Hybrid Ranking Network for Text-to-SQL [9.731436359069493]
We propose a neat approach called Hybrid Ranking Network (HydraNet) which breaks down the problem into column-wise ranking and decoding.
Experiments on the dataset show that the proposed approach is very effective, achieving the top place on the leaderboard.
arXiv Detail & Related papers (2020-08-11T15:01:52Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.