Related papers: Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload

Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload

URL: http://arxiv.org/abs/2407.19517v1
Date: Sun, 28 Jul 2024 15:53:05 GMT
Title: Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload
Authors: Limin Ma, Ken Pu, Ying Zhu,
Abstract summary: TPC-DS queries exhibit a significantly higher level of structural complexity compared to the other two benchmarks. Findings indicate that the current state-of-the-art generative AI models fall short in generating accurate decision-making queries. Results demonstrated that the accuracy of the generated queries is insufficient for practical real-world application.
Score: 1.2738020945091273
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study presents a comparative analysis of the a complex SQL benchmark, TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings reveal that TPC-DS queries exhibit a significantly higher level of structural complexity compared to the other two benchmarks. This underscores the need for more intricate benchmarks to simulate realistic scenarios effectively. To facilitate this comparison, we devised several measures of structural complexity and applied them across all three benchmarks. The results of this study can guide future research in the development of more sophisticated text-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based on the query descriptions provided by the TPC-DS benchmark. The prompt engineering process incorporated both the query description as outlined in the TPC-DS specification and the database schema of TPC-DS. Our findings indicate that the current state-of-the-art generative AI models fall short in generating accurate decision-making queries. We conducted a comparison of the generated queries with the TPC-DS gold standard queries using a series of fuzzy structure matching techniques based on query features. The results demonstrated that the accuracy of the generated queries is insufficient for practical real-world application.

Related papers

LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z)
Rationalization Models for Text-to-SQL [13.792561265515003]
We introduce a framework for generating Chain-of-Thought (CoT) rationales to enhance text-to-thought model fine-tuning. The process begins with manually annotating a small set of examples, which are then used to prompt a large language model. A rationalization model is subsequently trained on the validated queries, enabling extensive synthetic CoT annotations.
arXiv Detail & Related papers (2025-02-10T18:38:57Z)
Text-to-SQL based on Large Language Models and Database Keyword Search [0.0]
This paper proposes a strategy to compile Natural Language (NL) questions intosql queries. The strategy incorporates a dynamic few-shot examples strategy and leverages the services provided by a database keyword search (KwS) platform. Experiments show that the strategy achieves an accuracy on the real-world relational database that surpasses state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-23T12:03:29Z)
An Actor-Critic Approach to Boosting Text-to-SQL Large Language Model [7.01795534825797]
We propose a simple, general, and performance guaranteed T2S enhancement approach called Actor-Critic (AC) We design two roles using the same large language models (LLMs): an Actor to producesql queries and a Critic to evaluate the producedsql. If the Critic believes the producedsql is wrong, it notifies the Actor to reproduce thesql and perform evaluation again. We conducted extensive experiments on the Spider and related datasets with eleven LLMs, and demonstrated that the Actor-Critic method consistently improves the performance of T2S.
arXiv Detail & Related papers (2024-10-28T15:22:35Z)
CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL [9.47170756607886]
CHASE- is a new framework that employs innovative strategies, using test-time compute in multi-agent modeling to improve candidate generation and selection. To identify the best candidate, a selection agent is employed to rank the candidates through pairwise comparisons with a fine-tuned binary-candidates selection LLM. Overall, our proposed CHASE- achieves the state-of-the-art execution accuracy of 73.0% and 73.01% on the test set and development set of the notable BIRD Text-to- dataset benchmark.
arXiv Detail & Related papers (2024-10-02T18:41:35Z)
Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries. Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD. This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z)
UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics. We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z)
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation [10.726734105960924]
Large language models (LLMs) have enabled in-context learning (ICL)-based methods that significantly outperform fine-tuning approaches for text-to- tasks. This study considers the sensitivity of LLMs to the prompts and introduces a novel approach that leverages multiple prompts to explore a broader search space for possible answers. We establish a new SOTA performance on the BIRD in terms of both the accuracy and efficiency of the generated queries.
arXiv Detail & Related papers (2024-05-13T04:59:32Z)
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)
Structure Guided Large Language Model for SQL Generation [14.079764882536077]
We propose a novel structure-aware text-to- query and framework(SGU)<n>SGU-aware text-to- query and framework(SGU) consistently outperforms state-of-the-art text-to-models.
arXiv Detail & Related papers (2024-02-19T09:07:59Z)
Semantic Decomposition of Question and SQL for Text-to-SQL Parsing [2.684900573255764]
We propose a new modular Query Plan Language (QPL) that systematically decomposessql queries into simple and regular sub-queries. Experimental results demonstrate that QPL is more effective than text-to-QPL for semantically equivalent queries.
arXiv Detail & Related papers (2023-10-20T15:13:34Z)
SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs) With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses. With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z)
UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems. It is composed of publicly available text-to-domain datasets and 29K databases. Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z)
Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric. Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences. Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.