DFIN-SQL: Integrating Focused Schema with DIN-SQL for Superior Accuracy
in Large-Scale Databases
- URL: http://arxiv.org/abs/2403.00872v1
- Date: Fri, 1 Mar 2024 07:14:45 GMT
- Title: DFIN-SQL: Integrating Focused Schema with DIN-SQL for Superior Accuracy
in Large-Scale Databases
- Authors: Shai Volvovsky, Marco Marcassa, Mustafa Panbiharwala
- Abstract summary: This paper introduces DFIN, an innovative extension of DIN-composed (Decomposed-In-Context)
DFIN enhances Text-to-composed conversion by addressing schema linking errors, which are a major source of inaccuracies.
Our evaluation on the BIRD dataset, a challenging real-world benchmark, demonstrates that DFIN not only efficiently but also improves accuracy, achieving a score of 51.69.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of converting natural language queries into SQL queries is
intricate, necessitating a blend of precise techniques for an accurate
translation. The DIN-SQL (Decomposed-In-Context SQL) methodology represents a
significant development in this domain. This paper introduces DFIN (Decomposed
Focused-In-Context), an innovative extension of DIN-SQL that enhances
Text-to-SQL conversion by addressing schema linking errors, which are a major
source of inaccuracies. DFIN uniquely alternates between prompting techniques
and Retrieval-Augmented Generation (RAG), adapting to the size and complexity
of the database schema. A preprocessing phase embeds database definitions and
leverages annotated files, akin to those in the BIRD dataset, facilitating the
runtime retrieval of pertinent schema information. This strategy significantly
reduces the token count for schema linking prompts, enabling the use of a
standard GPT-4 model over its larger context variant, thus handling large-scale
databases more effectively and economically. Our evaluation on the BIRD
dataset, a challenging real-world benchmark, demonstrates that DFIN not only
scales efficiently but also improves accuracy, achieving a score of 51.69. This
improvement surpasses DIN-SQL method (the current third-place), which is the
highest-ranked model employing in-context learning rather than fine-tuning,
previously scoring 50.72. The advancement of DFIN underscores the evolving
capabilities of in-context learning methodologies combined with advanced
language models, offering a promising avenue for future research in complex
Text-to-SQL conversion tasks.
Related papers
- RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction.
benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection.
Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z) - TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.
TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.
Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z) - Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries.
Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD.
This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z) - E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL [1.187832944550453]
We introduce E- repository, a novel pipeline designed to address challenges through direct schema linking and candidate predicate augmentation.
E- enhances the natural language query by incorporating relevant database items (i.e. tables, columns, and values) and conditions directly into the question, bridging the gap between the query and the database structure.
We investigate the impact of schema filtering, a technique widely explored in previous work, and demonstrate its diminishing returns when applied alongside advanced large language models.
arXiv Detail & Related papers (2024-09-25T09:02:48Z) - Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks.
We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z) - RH-SQL: Refined Schema and Hardness Prompt for Text-to-SQL [1.734218686180302]
This paper introduces a method for Text-to- Execute based on Refined Execution Model and Hardness Prompt.
It reduces storage and training costs while maintaining performance.
Our experiments on the Spider dataset, specifically with large-scale LMs, achieved an exceptional accuracy (EX) of 82.6%.
arXiv Detail & Related papers (2024-06-13T14:04:34Z) - CodeS: Towards Building Open-source Language Models for Text-to-SQL [42.11113113574589]
We introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B.
CodeS is a fully open language model, which achieves superior accuracy with much smaller parameter sizes.
We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark.
arXiv Detail & Related papers (2024-02-26T07:00:58Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for
Text-to-SQL Parsing [56.232873134174056]
One of the major challenges in text-to-text parsing is domain generalization, i.e., how to well generalize to unseen databases.
In this work, we explore ways to further augment the pre-trained text-to-text transformer model with specialized components for text-to-text parsing.
To this end, we propose a new architecture GRAPHIX-T5, augmented by some specially-designed graph-aware model with layers.
arXiv Detail & Related papers (2023-01-18T13:29:05Z) - N-Best Hypotheses Reranking for Text-To-SQL Systems [6.966624873109535]
Text-to- task maps natural language utterances to structured queries.
State-of-the-art (SOTA) systems rely on finetuning large, pre-trained language models.
Findings show significant potential improvements with reranking.
arXiv Detail & Related papers (2022-10-19T15:35:06Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.