Related papers: SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

URL: http://arxiv.org/abs/2510.26840v1
Date: Thu, 30 Oct 2025 02:29:54 GMT
Title: SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
Authors: Rocky Klopfenstein, Yang He, Andrew Tremante, Yuepeng Wang, Nina Narodytska, Haoze Wu,
Abstract summary: We propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth queries.<n>A performance evaluation of ten Text-to-truth methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth.
Score: 9.733987594033907
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.

Related papers

SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints [9.733987594033907]
SpotIt+ is a tool for evaluating text-to-speech systems via bounded equivalence verification.<n>We introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation.<n> Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases.
arXiv Detail & Related papers (2026-03-04T17:51:42Z)
Bridging Global Intent with Local Details: A Hierarchical Representation Approach for Semantic Validation in Text-to-SQL [30.78817492504152]
HERO is a hierarchical representation approach that integrates global intent and local details.<n>We employ a Nested Message Passing Neural Network (NMPNN) to capture inherent information in relational schema-guided semantics.<n>Our approach outperforms existing state-of-the-art methods, achieving an average 9.40% improvement of AUPRC and 12.35% of AUROC in identifying semantic inconsistencies.<n>It excels at detecting fine-grained semantic errors, provides large language models with more granular feedback, and ultimately enhances the reliability and interpretability of data querying platforms.
arXiv Detail & Related papers (2025-12-28T02:25:33Z)
Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation [54.53145282349042]
We introduce DSR-sourced, a textbfDual-textbfS textbfReasoning framework that models Text-to-context as an interaction between an adaptive context state and a progressive generation state.<n>Without any post-training or in-context examples, DSR-sourced achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set.
arXiv Detail & Related papers (2025-11-26T13:52:50Z)
SQLord: A Robust Enterprise Text-to-SQL Solution via Reverse Data Generation and Workflow Decomposition [8.468281360094181]
Existing frameworks, trained on open-source datasets, struggle with complex business logic.<n> evaluation methods often require annotated data environments, which are scarce in real-world scenarios.<n>We propose SQLord, an enterprise-level NL2 framework to address these challenges.<n>It has been successfully applied across multiple scenarios on the world's largest B2B e-commerce platform.
arXiv Detail & Related papers (2025-07-14T08:16:55Z)
Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries. Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD. This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z)
FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark [8.445403382578167]
This paper introduces FLEX (False-Lesscution EXecution), a novel approach to evaluating text-to-technical systems. Our metric improves agreement with human experts with comprehensive context and sophisticated criteria. This work contributes to a more accurate and nuanced evaluation of text-to-technical systems, potentially reshaping our understanding of state-of-the-art performance in this field.
arXiv Detail & Related papers (2024-09-24T01:40:50Z)
DAC: Decomposed Automation Correction for Text-to-SQL [51.48239006107272]
We introduce De Automation Correction (DAC), which corrects text-to-composed by decomposing entity linking and skeleton parsing. We show that our method improves performance by $3.7%$ on average of Spider, Bird, and KaggleDBQA compared with the baseline method.
arXiv Detail & Related papers (2024-08-16T14:43:15Z)
SQLPrompt: In-Context Text-to-SQL with Minimal Labeled Data [54.69489315952524]
"Prompt" is designed to improve the few-shot prompting capabilities of Text-to-LLMs. "Prompt" outperforms previous approaches for in-context learning with few labeled data by a large margin. We show that emphPrompt outperforms previous approaches for in-context learning with few labeled data by a large margin.
arXiv Detail & Related papers (2023-11-06T05:24:06Z)
Evaluating Cross-Domain Text-to-SQL Models and Benchmarks [7.388002745070808]
We study text-to- benchmarks and re-evaluate some of the top-performing models within these benchmarks. We find that attaining a perfect performance on these benchmarks is unfeasible due to the multiple interpretations that can be derived from the provided samples. A GPT4-based model surpasses the gold standard reference queries in the Spider benchmark in our human evaluation.
arXiv Detail & Related papers (2023-10-27T23:36:14Z)
SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs) With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses. With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z)
UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems. It is composed of publicly available text-to-domain datasets and 29K databases. Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing [110.97778888305506]
BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. BRIDGE attained state-of-the-art performance on popular cross-DB text-to- relational benchmarks. Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
arXiv Detail & Related papers (2020-12-23T12:33:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.