Related papers: DocuT5: Seq2seq SQL Generation with Table Documentation

DocuT5: Seq2seq SQL Generation with Table Documentation

URL: http://arxiv.org/abs/2211.06193v1
Date: Fri, 11 Nov 2022 13:31:55 GMT
Title: DocuT5: Seq2seq SQL Generation with Table Documentation
Authors: Elena Soare, Iain Mackie, Jeffrey Dalton
Abstract summary: We develop a new text-to- taxonomy failure taxonomy and find that 19.6% of errors are due to foreign key mistakes. We propose DocuT5, a method that captures knowledge from (1) table structure context of foreign keys and (2) domain knowledge through contextualizing tables and columns. Both types of knowledge improve over state-of-the-art T5 with constrained decoding on Spider, and domain knowledge produces state-of-the-art comparable effectiveness on Spider-DK and Spider-SYN datasets.
Score: 5.586191108738563
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current SQL generators based on pre-trained language models struggle to answer complex questions requiring domain context or understanding fine-grained table structure. Humans would deal with these unknowns by reasoning over the documentation of the tables. Based on this hypothesis, we propose DocuT5, which uses off-the-shelf language model architecture and injects knowledge from external `documentation' to improve domain generalization. We perform experiments on the Spider family of datasets that contain complex questions that are cross-domain and multi-table. Specifically, we develop a new text-to-SQL failure taxonomy and find that 19.6% of errors are due to foreign key mistakes, and 49.2% are due to a lack of domain knowledge. We proposed DocuT5, a method that captures knowledge from (1) table structure context of foreign keys and (2) domain knowledge through contextualizing tables and columns. Both types of knowledge improve over state-of-the-art T5 with constrained decoding on Spider, and domain knowledge produces state-of-the-art comparable effectiveness on Spider-DK and Spider-SYN datasets.

Related papers

Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup [6.249316460506702]
We identify two important gaps: the structural mapping gap and the lexical mapping gap. PAS-related achieves an execution accuracy of 87.9%, and leading results on the BIRD dataset with an execution accuracy of 64.67%. Results on the Spider benchmark set a new state-of-the-art on the Spider benchmark with an execution accuracy of 87.9%, and leading results on the BIRD dataset with an execution accuracy of 64.67%.
arXiv Detail & Related papers (2025-02-20T16:11:27Z)
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows [64.94146689665628]
Spider 2.0 is an evaluation framework for real-world text-to-sql problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-levels.
arXiv Detail & Related papers (2024-11-12T12:52:17Z)
RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction. benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection. Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z)
TANQ: An open domain dataset of table answered questions [15.323690523538572]
TANQ is the first open domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points.
arXiv Detail & Related papers (2024-05-13T14:07:20Z)
Domain Adaptation of a State of the Art Text-to-SQL Model: Lessons Learned and Challenges Found [1.9963385352536616]
We analyze how well the base T5 Language Model and Picard perform on query structures different from the Spider dataset. We present an alternative way to disambiguate the values in an input question using a rule-based approach.
arXiv Detail & Related papers (2023-12-09T03:30:21Z)
UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems. It is composed of publicly available text-to-domain datasets and 29K databases. Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
QURG: Question Rewriting Guided Context-Dependent Text-to-SQL Semantic Parsing [46.05006486399823]
This paper presents QURG, a novel Question Rewriting Guided approach to help the models achieve adequate contextual understanding. We first train a question rewriting model to complete the current question based on question context, and convert them into a rewriting edit matrix. We further design a two-stream matrix encoder to jointly model rewriting relations between question and context, and the schema linking relations between natural language and structured schema.
arXiv Detail & Related papers (2023-05-11T08:45:55Z)
Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic Knowledge [54.85168428642474]
We build a new Chinese benchmark Know consisting of domain-specific questions covering various domains. We then address this problem by presenting formulaic knowledge, rather than by annotating additional data examples. More concretely, we construct a formulaic knowledge bank as a domain knowledge base and propose a framework (ReGrouP) to leverage this formulaic knowledge during parsing.
arXiv Detail & Related papers (2023-01-03T12:37:47Z)
Towards Generalizable and Robust Text-to-SQL Parsing [77.18724939989647]
We propose a novel TKK framework consisting of Task decomposition, Knowledge acquisition, and Knowledge composition to learn text-to- parsing in stages. We show that our framework is effective in all scenarios and state-of-the-art performance on the Spider, SParC, and Co. datasets.
arXiv Detail & Related papers (2022-10-23T09:21:27Z)
Prefix-to-SQL: Text-to-SQL Generation from Incomplete User Questions [33.48258057604425]
We propose a new task, prefix-to-Query, which takes question prefix from users as the input and predicts the intendedsql. We construct a new benchmark called PAGSAS that contains 124K user question prefixes and the intendedsql for 5 sub-tasks Advising, GeoQuery, Scholar, ATIS, and Spider. As we observe the difficulty of prefix-to-Query is related to the number of omitted tokens, we incorporate curriculum learning of feeding examples with an increasing number of omitted tokens.
arXiv Detail & Related papers (2021-09-15T14:28:18Z)
Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open Domain Question Answering [78.9863753810787]
A large amount of world's knowledge is stored in structured databases. query languages can answer questions that require complex reasoning, as well as offering full explainability.
arXiv Detail & Related papers (2021-08-05T22:04:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.