Related papers: CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL

CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL

URL: http://arxiv.org/abs/2311.01173v1
Date: Thu, 2 Nov 2023 12:13:52 GMT
Title: CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL
Authors: Mayank Kothyari, Dhruva Dhingra, Sunita Sarawagi, Soumen Chakrabarti
Abstract summary: Existing text-to-text generators require the entire schema to be encoded with user text. Standard dense retrieval techniques are inadequate for schema subsetting a large structured database. We introduce three benchmarks for schema subsetting on large databases.
Score: 47.14954737590405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing Text-to-SQL generators require the entire schema to be encoded with the user text. This is expensive or impractical for large databases with tens of thousands of columns. Standard dense retrieval techniques are inadequate for schema subsetting of a large structured database, where the correct semantics of retrieval demands that we rank sets of schema elements rather than individual elements. In response, we propose a two-stage process for effective coverage during retrieval. First, we instruct an LLM to hallucinate a minimal DB schema deemed adequate to answer the query. We use the hallucinated schema to retrieve a subset of the actual schema, by composing the results from multiple dense retrievals. Remarkably, hallucination $\unicode{x2013}$ generally considered a nuisance $\unicode{x2013}$ turns out to be actually useful as a bridging mechanism. Since no existing benchmarks exist for schema subsetting on large databases, we introduce three benchmarks. Two semi-synthetic datasets are derived from the union of schemas in two well-known datasets, SPIDER and BIRD, resulting in 4502 and 798 schema elements respectively. A real-life benchmark called SocialDB is sourced from an actual large data warehouse comprising 17844 schema elements. We show that our method1 leads to significantly higher recall than SOTA retrieval-based augmentation methods.

Related papers

LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL [14.677024710675838]
LinkAlign is a novel framework that can effectively adapt existing baselines to real-world environments. We evaluate our method performance on the SPIDER and BIRD benchmarks. LinkAlign ranks highest among models excluding those using long chain-of-thought reasoning LLMs.
arXiv Detail & Related papers (2025-03-24T11:53:06Z)
Extractive Schema Linking for Text-to-SQL [17.757832644216446]
Text-to-one is emerging as a practical interface for real world databases. We introduce a new approach to adapt decoder-only LLMs to schema linking.
arXiv Detail & Related papers (2025-01-23T19:57:08Z)
SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark [4.049028351548513]
Different database models have a big impact on query complexity and performance. We present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark.
arXiv Detail & Related papers (2024-11-08T12:27:13Z)
RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction. benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection. Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z)
The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models [0.9149661171430259]
We revisit schema linking when using the latest generation of large language models (LLMs) We find empirically that newer models are adept at utilizing relevant schema elements during generation even in the presence of large numbers of irrelevant ones. Instead of filtering contextual information, we highlight techniques such as augmentation, selection, and correction, and adopt them to improve the accuracy of our Text-to-BIRD pipeline.
arXiv Detail & Related papers (2024-08-14T17:59:04Z)
Benchmarking and Improving Text-to-SQL Generation under Ambiguity [25.283118418288293]
We develop a novel benchmark called AmbiQT where each text is interpretable as two plausible SQLs due to lexical and/or structural ambiguity. We propose LogicalBeam, a new decoding algorithm that navigates thesql logic space using a blend of plan-based template generation and constrained infilling.
arXiv Detail & Related papers (2023-10-20T17:00:53Z)
UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems. It is composed of publicly available text-to-domain datasets and 29K databases. Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
Improving Text-to-SQL Semantic Parsing with Fine-grained Query Understanding [84.04706075621013]
We present a general-purpose, modular neural semantic parsing framework based on token-level fine-grained query understanding. Our framework consists of three modules: named entity recognizer (NER), neural entity linker (NEL) and neural entity linker (NSP)
arXiv Detail & Related papers (2022-09-28T21:00:30Z)
Semantic Enhanced Text-to-SQL Parsing via Iteratively Learning Schema Linking Graph [6.13728903057727]
The generalizability to new databases is of vital importance to Text-to- systems which aim to parse human utterances intosql statements. In this paper, we propose a framework named IS ESL to iteratively build a enhanced semantic schema-linking graph between question tokens and database schemas. Extensive experiments on three benchmarks demonstrate that IS ESL could consistently outperform the baselines and further investigations show its generalizability and robustness.
arXiv Detail & Related papers (2022-08-08T03:59:33Z)
Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric. Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences. Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z)
ShadowGNN: Graph Projection Neural Network for Text-to-SQL Parser [36.12921337235763]
We propose a new architecture, ShadowGNN, which processes schemas at abstract and semantic levels. On the challenging Text-to-Spider benchmark, empirical results show that ShadowGNN outperforms state-of-the-art models.
arXiv Detail & Related papers (2021-04-10T05:48:28Z)
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing [110.97778888305506]
BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. BRIDGE attained state-of-the-art performance on popular cross-DB text-to- relational benchmarks. Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
arXiv Detail & Related papers (2020-12-23T12:33:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.