Related papers: FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

URL: http://arxiv.org/abs/2409.19014v4
Date: Mon, 28 Oct 2024 11:11:04 GMT
Title: FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark
Authors: Heegyu Kim, Taeyang Jeon, Seunghwan Choi, Seungtaek Choi, Hyunsouk Cho,
Abstract summary: This paper introduces FLEX (False-Lesscution EXecution), a novel approach to evaluating text-to-technical systems. Our metric improves agreement with human experts with comprehensive context and sophisticated criteria. This work contributes to a more accurate and nuanced evaluation of text-to-technical systems, potentially reshaping our understanding of state-of-the-art performance in this field.
Score: 8.445403382578167
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Text-to-SQL systems have become crucial for translating natural language into SQL queries in various industries, enabling non-technical users to perform complex data operations. The need for accurate evaluation methods has increased as these systems have grown more sophisticated. However, the Execution Accuracy (EX), the most prevalent evaluation metric, still shows many false positives and negatives. Thus, this paper introduces FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries. Our metric improves agreement with human experts (from 62 to 87.04 in Cohen's kappa) with comprehensive context and sophisticated criteria. Our extensive experiments yield several key insights: (1) Models' performance increases by over 2.6 points on average, substantially affecting rankings on Spider and BIRD benchmarks; (2) The underestimation of models in EX primarily stems from annotation quality issues; and (3) Model performance on particularly challenging questions tends to be overestimated. This work contributes to a more accurate and nuanced evaluation of text-to-SQL systems, potentially reshaping our understanding of state-of-the-art performance in this field.

Related papers

LLM-Driven Data Generation and a Novel Soft Metric for Evaluating Text-to-SQL in Aviation MRO [0.6374763930914525]
We introduce a novel F1-score-based'soft' metric that quantifies the informational overlap between generated and ground-truth results.<n>We demonstrate our contributions through an empirical evaluation on an authentic MRO database.
arXiv Detail & Related papers (2025-06-11T04:04:13Z)
RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z)
Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL [35.21185734929167]
We present Arctic-Text2-R1, a reinforcement learning (RL) framework and model family designed to generate accurate, executablesql.<n>Our approach avoids curated intermediate supervision and complex reward shaping, promoting stable training and alignment with the end task.<n> Notably, our 7B model outperforms prior 70B-class systems, highlighting the framework's scalability and efficiency.
arXiv Detail & Related papers (2025-05-22T23:33:47Z)
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows [64.94146689665628]
Spider 2.0 is an evaluation framework for real-world text-to-sql problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-levels.
arXiv Detail & Related papers (2024-11-12T12:52:17Z)
Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries. Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD. This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z)
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL [1.187832944550453]
We introduce E-Seek, a novel pipeline specifically designed to address these challenges through direct schema linking and candidate predicate augmentation. E-Seek enhances the natural language query by incorporating relevant database items (i.e., tables, columns, and values) and conditions directly into the question andsql construction plan, bridging the gap between the query and the database structure. Comprehensive evaluations illustrate that E-Seek achieves competitive performance, particularly excelling in complex queries with a 66.29% execution accuracy on the test set.
arXiv Detail & Related papers (2024-09-25T09:02:48Z)
ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models [8.618945530676614]
Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM) suffer from inherent limitations that can misrepresent performance. We introduce a new metric, Enhanced Tree Matching (ETM), which mitigates these issues by comparing queries using both syntactic and semantic elements. We show that ETM and ESM can produce false positive and negative rates as high as 23.0% and 28.9%, while ETM reduces these rates to 0.3% and 2.7%, respectively.
arXiv Detail & Related papers (2024-07-10T02:20:19Z)
Evaluating the Data Model Robustness of Text-to-SQL Systems Based on Real User Queries [4.141402725050671]
This paper is the first in-depth evaluation of the data model robustness of Text-to-- systems in practice. It is based on a real-world deployment of FootballDB, a system that was deployed over a 9 month period in the context of the FIFA World Cup 2022. All of our data is based on real user questions that were asked live to the system. We manually labeled and translated a subset of these questions for three different data models.
arXiv Detail & Related papers (2024-02-13T10:28:57Z)
Enhancing Text-to-SQL Translation for Financial System Design [5.248014305403357]
We consider Large Language Models (LLMs), which have achieved state of the art for various NLP tasks. We propose two novel metrics that were designed to adequately measure the similarity between relational queries.
arXiv Detail & Related papers (2023-12-22T14:34:19Z)
Reboost Large Language Model-based Text-to-SQL, Text-to-Python, and Text-to-Function -- with Real Applications in Traffic Domain [14.194710636073808]
Previous state-of-the-art (SOTA) method achieved remarkable execution accuracy on the Spider dataset. We develop a more adaptable and more general prompting method, involving query rewriting andsql boosting. In terms of execution accuracy on the business dataset, the SOTA method scored 21.05, while our approach scored 65.79.
arXiv Detail & Related papers (2023-10-28T16:32:40Z)
Evaluating Cross-Domain Text-to-SQL Models and Benchmarks [7.388002745070808]
We study text-to- benchmarks and re-evaluate some of the top-performing models within these benchmarks. We find that attaining a perfect performance on these benchmarks is unfeasible due to the multiple interpretations that can be derived from the provided samples. A GPT4-based model surpasses the gold standard reference queries in the Spider benchmark in our human evaluation.
arXiv Detail & Related papers (2023-10-27T23:36:14Z)
SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs) With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses. With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z)
UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems. It is composed of publicly available text-to-domain datasets and 29K databases. Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs [89.68522473384522]
We present Bird, a big benchmark for large-scale database grounded in text-to-efficient tasks. Our emphasis on database values highlights the new challenges of dirty database contents. Even the most effective text-to-efficient models, i.e. ChatGPT, achieves only 40.08% in execution accuracy.
arXiv Detail & Related papers (2023-05-04T19:02:29Z)
Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing [56.232873134174056]
One of the major challenges in text-to-text parsing is domain generalization, i.e., how to well generalize to unseen databases. In this work, we explore ways to further augment the pre-trained text-to-text transformer model with specialized components for text-to-text parsing. To this end, we propose a new architecture GRAPHIX-T5, augmented by some specially-designed graph-aware model with layers.
arXiv Detail & Related papers (2023-01-18T13:29:05Z)
"What Do You Mean by That?" A Parser-Independent Interactive Approach for Enhancing Text-to-SQL [49.85635994436742]
We include human in the loop and present a novel-independent interactive approach (PIIA) that interacts with users using multi-choice questions. PIIA is capable of enhancing the text-to-domain performance with limited interaction turns by using both simulation and human evaluation.
arXiv Detail & Related papers (2020-11-09T02:14:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.