ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models
- URL: http://arxiv.org/abs/2407.07313v3
- Date: Wed, 12 Feb 2025 17:20:56 GMT
- Title: ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models
- Authors: Benjamin G. Ascoli, Yasoda Sai Ram Kandikonda, Jinho D. Choi,
- Abstract summary: Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM) suffer from inherent limitations that can misrepresent performance.
We introduce a new metric, Enhanced Tree Matching (ETM), which mitigates these issues by comparing queries using both syntactic and semantic elements.
We show that ETM and ESM can produce false positive and negative rates as high as 23.0% and 28.9%, while ETM reduces these rates to 0.3% and 2.7%, respectively.
- Score: 8.618945530676614
- License:
- Abstract: The task of Text-to-SQL enables anyone to retrieve information from SQL databases using natural language. While this task has made substantial progress, the two primary evaluation metrics -- Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM) -- suffer from inherent limitations that can misrepresent performance. Specifically, ESM's rigid matching overlooks semantically correct but stylistically different queries, whereas EXE can overestimate correctness by ignoring structural errors that yield correct outputs. These shortcomings become especially problematic when assessing outputs from large language model (LLM)-based approaches without fine-tuning, which vary more in style and structure compared to their fine-tuned counterparts. Thus, we introduce a new metric, Enhanced Tree Matching (ETM), which mitigates these issues by comparing queries using both syntactic and semantic elements. Through evaluating nine LLM-based models, we show that EXE and ESM can produce false positive and negative rates as high as 23.0% and 28.9%, while ETM reduces these rates to 0.3% and 2.7%, respectively. We release our ETM script as open source, offering the community a more robust and reliable approach to evaluating Text-to-SQL.
Related papers
- Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries.
Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD.
This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z) - FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark [8.445403382578167]
This paper introduces FLEX (False-Lesscution EXecution), a novel approach to evaluating text-to-technical systems.
Our metric improves agreement with human experts with comprehensive context and sophisticated criteria.
This work contributes to a more accurate and nuanced evaluation of text-to-technical systems, potentially reshaping our understanding of state-of-the-art performance in this field.
arXiv Detail & Related papers (2024-09-24T01:40:50Z) - SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL [3.422309388045878]
We introduce SelECT-, a novel in-context learning solution that uses an algorithmic combination of chain-of-thought, self-correction, and ensemble methods.
Specifically, when configured using GPT as the base LLM, SelECT-Turbo achieves 84.2% execution accuracy on the Spider leaderboard's development set.
arXiv Detail & Related papers (2024-09-16T05:40:18Z) - DAC: Decomposed Automation Correction for Text-to-SQL [51.48239006107272]
We introduce De Automation Correction (DAC), which corrects text-to-composed by decomposing entity linking and skeleton parsing.
We show that our method improves performance by $3.7%$ on average of Spider, Bird, and KaggleDBQA compared with the baseline method.
arXiv Detail & Related papers (2024-08-16T14:43:15Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation [76.76046657162306]
Large language models (LLMs) have emerged as a new paradigm for Text-to- task.
Large language models (LLMs) have emerged as a new paradigm for Text-to- task.
arXiv Detail & Related papers (2023-08-29T14:59:54Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - Improving Text-to-SQL Semantic Parsing with Fine-grained Query
Understanding [84.04706075621013]
We present a general-purpose, modular neural semantic parsing framework based on token-level fine-grained query understanding.
Our framework consists of three modules: named entity recognizer (NER), neural entity linker (NEL) and neural entity linker (NSP)
arXiv Detail & Related papers (2022-09-28T21:00:30Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.