CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL
- URL: http://arxiv.org/abs/2311.01173v1
- Date: Thu, 2 Nov 2023 12:13:52 GMT
- Title: CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL
- Authors: Mayank Kothyari, Dhruva Dhingra, Sunita Sarawagi, Soumen Chakrabarti
- Abstract summary: Existing text-to-text generators require the entire schema to be encoded with user text.
Standard dense retrieval techniques are inadequate for schema subsetting a large structured database.
We introduce three benchmarks for schema subsetting on large databases.
- Score: 47.14954737590405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Text-to-SQL generators require the entire schema to be encoded with
the user text. This is expensive or impractical for large databases with tens
of thousands of columns. Standard dense retrieval techniques are inadequate for
schema subsetting of a large structured database, where the correct semantics
of retrieval demands that we rank sets of schema elements rather than
individual elements. In response, we propose a two-stage process for effective
coverage during retrieval. First, we instruct an LLM to hallucinate a minimal
DB schema deemed adequate to answer the query. We use the hallucinated schema
to retrieve a subset of the actual schema, by composing the results from
multiple dense retrievals. Remarkably, hallucination $\unicode{x2013}$
generally considered a nuisance $\unicode{x2013}$ turns out to be actually
useful as a bridging mechanism. Since no existing benchmarks exist for schema
subsetting on large databases, we introduce three benchmarks. Two
semi-synthetic datasets are derived from the union of schemas in two well-known
datasets, SPIDER and BIRD, resulting in 4502 and 798 schema elements
respectively. A real-life benchmark called SocialDB is sourced from an actual
large data warehouse comprising 17844 schema elements. We show that our method1
leads to significantly higher recall than SOTA retrieval-based augmentation
methods.
Related papers
- SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark [4.049028351548513]
Different database models have a big impact on query complexity and performance.
We present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark.
arXiv Detail & Related papers (2024-11-08T12:27:13Z) - RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction.
benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection.
Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z) - The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models [0.9149661171430259]
We revisit schema linking when using the latest generation of large language models (LLMs)
We find empirically that newer models are adept at utilizing relevant schema elements during generation even in the presence of large numbers of irrelevant ones.
Instead of filtering contextual information, we highlight techniques such as augmentation, selection, and correction, and adopt them to improve the accuracy of our Text-to-BIRD pipeline.
arXiv Detail & Related papers (2024-08-14T17:59:04Z) - Benchmarking and Improving Text-to-SQL Generation under Ambiguity [25.283118418288293]
We develop a novel benchmark called AmbiQT where each text is interpretable as two plausible SQLs due to lexical and/or structural ambiguity.
We propose LogicalBeam, a new decoding algorithm that navigates thesql logic space using a blend of plan-based template generation and constrained infilling.
arXiv Detail & Related papers (2023-10-20T17:00:53Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Improving Text-to-SQL Semantic Parsing with Fine-grained Query
Understanding [84.04706075621013]
We present a general-purpose, modular neural semantic parsing framework based on token-level fine-grained query understanding.
Our framework consists of three modules: named entity recognizer (NER), neural entity linker (NEL) and neural entity linker (NSP)
arXiv Detail & Related papers (2022-09-28T21:00:30Z) - Semantic Enhanced Text-to-SQL Parsing via Iteratively Learning Schema
Linking Graph [6.13728903057727]
The generalizability to new databases is of vital importance to Text-to- systems which aim to parse human utterances intosql statements.
In this paper, we propose a framework named IS ESL to iteratively build a enhanced semantic schema-linking graph between question tokens and database schemas.
Extensive experiments on three benchmarks demonstrate that IS ESL could consistently outperform the baselines and further investigations show its generalizability and robustness.
arXiv Detail & Related papers (2022-08-08T03:59:33Z) - Proton: Probing Schema Linking Information from Pre-trained Language
Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric.
Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences.
Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z) - ShadowGNN: Graph Projection Neural Network for Text-to-SQL Parser [36.12921337235763]
We propose a new architecture, ShadowGNN, which processes schemas at abstract and semantic levels.
On the challenging Text-to-Spider benchmark, empirical results show that ShadowGNN outperforms state-of-the-art models.
arXiv Detail & Related papers (2021-04-10T05:48:28Z) - Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic
Parsing [110.97778888305506]
BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question.
BRIDGE attained state-of-the-art performance on popular cross-DB text-to- relational benchmarks.
Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
arXiv Detail & Related papers (2020-12-23T12:33:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.