Related papers: Structure-Grounded Pretraining for Text-to-SQL

Structure-Grounded Pretraining for Text-to-SQL

URL: http://arxiv.org/abs/2010.12773v3
Date: Wed, 31 Aug 2022 00:19:41 GMT
Title: Structure-Grounded Pretraining for Text-to-SQL
Authors: Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Oleksandr Polozov, Huan Sun, Matthew Richardson
Abstract summary: We present a novel weakly supervised StructureStrued pretraining framework (G) for text-to-LARGE. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder.
Score: 75.19554243393814
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings significant improvement over BERT-LARGE in all settings. Compared with existing pretraining methods such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. The Spider-Realistic dataset is available at https://doi.org/10.5281/zenodo.5205322.

Related papers

Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation [26.834687657847454]
Text-to-sql models are increasingly adopted in real-world applications. deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We find that existing text-to-sql models experience significant performance drops when applied to new schemas. Continuously obtaining high-quality text-to-sql data for evolving schemas is prohibitively expensive in real-world scenarios.
arXiv Detail & Related papers (2025-02-21T22:32:35Z)
EzSQL: An SQL intermediate representation for improving SQL-to-text Generation [1.6385815610837167]
We develop a new model called Ez to align with the natural language text sequence. Ez brings the queries closer to natural language text by modifying operators and keywords. We show that our model is an effective state-of-the-art method to generate text descriptions from queries on the Wiki and Spider datasets.
arXiv Detail & Related papers (2024-11-28T05:24:46Z)
UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems. It is composed of publicly available text-to-domain datasets and 29K databases. Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
Augmenting Multi-Turn Text-to-SQL Datasets with Self-Play [46.07002748587857]
We explore augmenting the training datasets using self-play, which leverages contextual information to synthesize new interactions. We find that self-play improves the accuracy of a strong baseline on SParC and Co, two widely used text-to-domain datasets.
arXiv Detail & Related papers (2022-10-21T16:40:07Z)
STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing [64.80483736666123]
We propose a novel pre-training framework STAR for context-dependent text-to- parsing. In addition, we construct a large-scale context-dependent text-to-the-art conversation corpus to pre-train STAR. Extensive experiments show that STAR achieves new state-of-the-art performance on two downstream benchmarks.
arXiv Detail & Related papers (2022-10-21T11:30:07Z)
A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases. Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z)
Self-supervised Text-to-SQL Learning with Header Alignment Training [4.518012967046983]
Self-supervised learning is a de-facto component for the recent success of deep learning in various fields. We propose a novel self-supervised learning framework to tackle discrepancy between a self-supervised learning objective and a task-specific objective. Our method is effective for training the model with scarce labeled data.
arXiv Detail & Related papers (2021-03-11T01:09:59Z)
GP: Context-free Grammar Pre-training for Text-to-SQL Parsers [7.652782364282768]
Grammar Pre-training (GP) is proposed to decode deep relations between question and database. Experiments show that our method is easier to converge during training and has excellent robustness.
arXiv Detail & Related papers (2021-01-25T05:41:31Z)
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing [110.97778888305506]
BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. BRIDGE attained state-of-the-art performance on popular cross-DB text-to- relational benchmarks. Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
arXiv Detail & Related papers (2020-12-23T12:33:52Z)
Hybrid Ranking Network for Text-to-SQL [9.731436359069493]
We propose a neat approach called Hybrid Ranking Network (HydraNet) which breaks down the problem into column-wise ranking and decoding. Experiments on the dataset show that the proposed approach is very effective, achieving the top place on the leaderboard.
arXiv Detail & Related papers (2020-08-11T15:01:52Z)
ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples. We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.