DocuT5: Seq2seq SQL Generation with Table Documentation
- URL: http://arxiv.org/abs/2211.06193v1
- Date: Fri, 11 Nov 2022 13:31:55 GMT
- Title: DocuT5: Seq2seq SQL Generation with Table Documentation
- Authors: Elena Soare, Iain Mackie, Jeffrey Dalton
- Abstract summary: We develop a new text-to- taxonomy failure taxonomy and find that 19.6% of errors are due to foreign key mistakes.
We propose DocuT5, a method that captures knowledge from (1) table structure context of foreign keys and (2) domain knowledge through contextualizing tables and columns.
Both types of knowledge improve over state-of-the-art T5 with constrained decoding on Spider, and domain knowledge produces state-of-the-art comparable effectiveness on Spider-DK and Spider-SYN datasets.
- Score: 5.586191108738563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current SQL generators based on pre-trained language models struggle to
answer complex questions requiring domain context or understanding fine-grained
table structure. Humans would deal with these unknowns by reasoning over the
documentation of the tables. Based on this hypothesis, we propose DocuT5, which
uses off-the-shelf language model architecture and injects knowledge from
external `documentation' to improve domain generalization. We perform
experiments on the Spider family of datasets that contain complex questions
that are cross-domain and multi-table. Specifically, we develop a new
text-to-SQL failure taxonomy and find that 19.6% of errors are due to foreign
key mistakes, and 49.2% are due to a lack of domain knowledge. We proposed
DocuT5, a method that captures knowledge from (1) table structure context of
foreign keys and (2) domain knowledge through contextualizing tables and
columns. Both types of knowledge improve over state-of-the-art T5 with
constrained decoding on Spider, and domain knowledge produces state-of-the-art
comparable effectiveness on Spider-DK and Spider-SYN datasets.
Related papers
- Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup [6.249316460506702]
We identify two important gaps: the structural mapping gap and the lexical mapping gap.
PAS-related achieves an execution accuracy of 87.9%, and leading results on the BIRD dataset with an execution accuracy of 64.67%.
Results on the Spider benchmark set a new state-of-the-art on the Spider benchmark with an execution accuracy of 87.9%, and leading results on the BIRD dataset with an execution accuracy of 64.67%.
arXiv Detail & Related papers (2025-02-20T16:11:27Z) - Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows [64.94146689665628]
Spider 2.0 is an evaluation framework for real-world text-to-sql problems derived from enterprise-level database use cases.
The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake.
We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-levels.
arXiv Detail & Related papers (2024-11-12T12:52:17Z) - RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction.
benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection.
Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z) - Domain Adaptation of a State of the Art Text-to-SQL Model: Lessons
Learned and Challenges Found [1.9963385352536616]
We analyze how well the base T5 Language Model and Picard perform on query structures different from the Spider dataset.
We present an alternative way to disambiguate the values in an input question using a rule-based approach.
arXiv Detail & Related papers (2023-12-09T03:30:21Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - QURG: Question Rewriting Guided Context-Dependent Text-to-SQL Semantic
Parsing [46.05006486399823]
This paper presents QURG, a novel Question Rewriting Guided approach to help the models achieve adequate contextual understanding.
We first train a question rewriting model to complete the current question based on question context, and convert them into a rewriting edit matrix.
We further design a two-stream matrix encoder to jointly model rewriting relations between question and context, and the schema linking relations between natural language and structured schema.
arXiv Detail & Related papers (2023-05-11T08:45:55Z) - Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic
Knowledge [54.85168428642474]
We build a new Chinese benchmark Know consisting of domain-specific questions covering various domains.
We then address this problem by presenting formulaic knowledge, rather than by annotating additional data examples.
More concretely, we construct a formulaic knowledge bank as a domain knowledge base and propose a framework (ReGrouP) to leverage this formulaic knowledge during parsing.
arXiv Detail & Related papers (2023-01-03T12:37:47Z) - Towards Generalizable and Robust Text-to-SQL Parsing [77.18724939989647]
We propose a novel TKK framework consisting of Task decomposition, Knowledge acquisition, and Knowledge composition to learn text-to- parsing in stages.
We show that our framework is effective in all scenarios and state-of-the-art performance on the Spider, SParC, and Co. datasets.
arXiv Detail & Related papers (2022-10-23T09:21:27Z) - Prefix-to-SQL: Text-to-SQL Generation from Incomplete User Questions [33.48258057604425]
We propose a new task, prefix-to-Query, which takes question prefix from users as the input and predicts the intendedsql.
We construct a new benchmark called PAGSAS that contains 124K user question prefixes and the intendedsql for 5 sub-tasks Advising, GeoQuery, Scholar, ATIS, and Spider.
As we observe the difficulty of prefix-to-Query is related to the number of omitted tokens, we incorporate curriculum learning of feeding examples with an increasing number of omitted tokens.
arXiv Detail & Related papers (2021-09-15T14:28:18Z) - Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open
Domain Question Answering [78.9863753810787]
A large amount of world's knowledge is stored in structured databases.
query languages can answer questions that require complex reasoning, as well as offering full explainability.
arXiv Detail & Related papers (2021-08-05T22:04:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.