SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints
- URL: http://arxiv.org/abs/2603.04334v1
- Date: Wed, 04 Mar 2026 17:51:42 GMT
- Title: SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints
- Authors: Rocky Klopfenstein, Yang He, Andrew Tremante, Yuepeng Wang, Nina Narodytska, Haoze Wu,
- Abstract summary: SpotIt+ is a tool for evaluating text-to-speech systems via bounded equivalence verification.<n>We introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation.<n> Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases.
- Score: 9.733987594033907
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present SpotIt+, an open-source tool for evaluating Text-to-SQL systems via bounded equivalence verification. Given a generated SQL query and the ground truth, SpotIt+ actively searches for database instances that differentiate the two queries. To ensure that the generated counterexamples reflect practically relevant discrepancies, we introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation. Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases, while preserving its ability to efficiently uncover numerous discrepancies between generated and gold SQL queries that are missed by standard test-based evaluation.
Related papers
- ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement [57.98138819417949]
We propose ErrorLLM, a framework that explicitly models text-to- querying.<n>We show that ErrorLLM achieves the most significant improvements over backbone initial generation.<n>ErrorLLM addresses both sides by high detection F1 score while maintaining refinement effectiveness.
arXiv Detail & Related papers (2026-03-04T05:27:20Z) - APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL [39.76924093980244]
APEX- verbalize is a framework that shifts the paradigm from passive translation to agentic exploration.<n>Our framework employs a hypothesis-verification loop to ground model reasoning in real data.
arXiv Detail & Related papers (2026-02-11T07:50:47Z) - Companion Agents: A Table-Information Mining Paradigm for Text-to-SQL [8.159121916366727]
Large-scale Text-to-curated benchmarks such as BIRD typically assume complete and accurate database annotations as well as available external knowledge.<n>This mismatch substantially limits the real-world applicability of state-of-the-domain Text-to-art systems.<n>We propose a database-centric approach that leverages intrinsic, fine-grained information residing in relational databases to construct missing evidence.
arXiv Detail & Related papers (2025-12-17T07:11:55Z) - FloodSQL-Bench: A Retrieval-Augmented Benchmark for Geospatially-Grounded Text-to-SQL [4.973502845481286]
FLOOD-BENCH is a benchmark for the flood management domain that integrates heterogeneous datasets through key-based, spatial, and hybrid joins.<n>The benchmark captures realistic flood-related information needs by combining social, infrastructural, and hazard data layers.
arXiv Detail & Related papers (2025-12-12T23:25:00Z) - Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation [54.53145282349042]
We introduce DSR-sourced, a textbfDual-textbfS textbfReasoning framework that models Text-to-context as an interaction between an adaptive context state and a progressive generation state.<n>Without any post-training or in-context examples, DSR-sourced achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set.
arXiv Detail & Related papers (2025-11-26T13:52:50Z) - SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification [9.733987594033907]
We propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth queries.<n>A performance evaluation of ten Text-to-truth methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth.
arXiv Detail & Related papers (2025-10-30T02:29:54Z) - RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z) - Tool-Assisted Agent on SQL Inspection and Refinement in Real-World Scenarios [28.55596803781757]
Database mismatches are more prevalent in real-world scenarios.
We introduce Spider-Mismatch, a new dataset constructed to reflect the condition mismatch problems encountered in real-world scenarios.
Our method achieves the highest performance on the averaged results of the Spider and Spider-Realistic datasets in few-shot settings.
arXiv Detail & Related papers (2024-08-30T03:38:37Z) - DAC: Decomposed Automation Correction for Text-to-SQL [51.48239006107272]
We introduce De Automation Correction (DAC), which corrects text-to-composed by decomposing entity linking and skeleton parsing.
We show that our method improves performance by $3.7%$ on average of Spider, Bird, and KaggleDBQA compared with the baseline method.
arXiv Detail & Related papers (2024-08-16T14:43:15Z) - Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation [76.76046657162306]
Large language models (LLMs) have emerged as a new paradigm for Text-to- task.
Large language models (LLMs) have emerged as a new paradigm for Text-to- task.
arXiv Detail & Related papers (2023-08-29T14:59:54Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.