CHESS: Contextual Harnessing for Efficient SQL Synthesis
- URL: http://arxiv.org/abs/2405.16755v2
- Date: Thu, 27 Jun 2024 17:13:32 GMT
- Title: CHESS: Contextual Harnessing for Efficient SQL Synthesis
- Authors: Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, Amin Saberi,
- Abstract summary: We propose a new pipeline that retrieves relevant data and context, selects an efficient schema, and synthesizes correct and efficient queries.
Our method achieves new state-of-the-art performance on the cross-domain challenging BIRD dataset.
- Score: 1.9506402593665235
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Utilizing large language models (LLMs) for transforming natural language questions into SQL queries (text-to-SQL) is a promising yet challenging approach, particularly when applied to real-world databases with complex and extensive schemas. In particular, effectively incorporating data catalogs and database values for SQL generation remains an obstacle, leading to suboptimal solutions. We address this problem by proposing a new pipeline that effectively retrieves relevant data and context, selects an efficient schema, and synthesizes correct and efficient SQL queries. To increase retrieval precision, our pipeline introduces a hierarchical retrieval method leveraging model-generated keywords, locality-sensitive hashing indexing, and vector databases. Additionally, we have developed an adaptive schema pruning technique that adjusts based on the complexity of the problem and the model's context size. Our approach generalizes to both frontier proprietary models like GPT-4 and open-source models such as Llama-3-70B. Through a series of ablation studies, we demonstrate the effectiveness of each component of our pipeline and its impact on the end-to-end performance. Our method achieves new state-of-the-art performance on the cross-domain challenging BIRD dataset.
Related papers
- E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL [1.187832944550453]
We introduce E- repository, a novel pipeline designed to address challenges through direct schema linking and candidate predicate augmentation.
E- enhances the natural language query by incorporating relevant database items (i.e. tables, columns, and values) and conditions directly into the question, bridging the gap between the query and the database structure.
We investigate the impact of schema filtering, a technique widely explored in previous work, and demonstrate its diminishing returns when applied alongside advanced large language models.
arXiv Detail & Related papers (2024-09-25T09:02:48Z) - Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks.
We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z) - UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - RH-SQL: Refined Schema and Hardness Prompt for Text-to-SQL [1.734218686180302]
This paper introduces a method for Text-to- Execute based on Refined Execution Model and Hardness Prompt.
It reduces storage and training costs while maintaining performance.
Our experiments on the Spider dataset, specifically with large-scale LMs, achieved an exceptional accuracy (EX) of 82.6%.
arXiv Detail & Related papers (2024-06-13T14:04:34Z) - DFIN-SQL: Integrating Focused Schema with DIN-SQL for Superior Accuracy
in Large-Scale Databases [0.0]
This paper introduces DFIN, an innovative extension of DIN-composed (Decomposed-In-Context)
DFIN enhances Text-to-composed conversion by addressing schema linking errors, which are a major source of inaccuracies.
Our evaluation on the BIRD dataset, a challenging real-world benchmark, demonstrates that DFIN not only efficiently but also improves accuracy, achieving a score of 51.69.
arXiv Detail & Related papers (2024-03-01T07:14:45Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - SPSQL: Step-by-step Parsing Based Framework for Text-to-SQL Generation [13.196264569882777]
The current mainstream end-to-end Text2 model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters.
This paper proposes a pipeline method: SP Experiments to achieve the desired result.
We construct the dataset based on the marketing business data of the State Grid Corporation of China.
arXiv Detail & Related papers (2023-05-10T10:01:36Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Proton: Probing Schema Linking Information from Pre-trained Language
Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric.
Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences.
Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z) - IGSQL: Database Schema Interaction Graph Based Neural Model for
Context-Dependent Text-to-SQL Generation [61.09660709356527]
We propose a database schema interaction graph encoder to utilize historicalal information of database schema items.
We evaluate our model on the benchmark SParC and Co datasets.
arXiv Detail & Related papers (2020-11-11T12:56:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.