Evaluating NL2SQL via SQL2NL
- URL: http://arxiv.org/abs/2509.04657v1
- Date: Thu, 04 Sep 2025 21:03:59 GMT
- Title: Evaluating NL2SQL via SQL2NL
- Authors: Mohammadtaher Safarzadeh, Afshin Oroojlooyjadid, Dan Roth,
- Abstract summary: New framework generates semantically equivalent, lexically diverse queries.<n>State-of-the-art models are far more brittle than standard benchmarks suggest.
- Score: 45.88028371034407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robust evaluation in the presence of linguistic variation is key to understanding the generalization capabilities of Natural Language to SQL (NL2SQL) models, yet existing benchmarks rarely address this factor in a systematic or controlled manner. We propose a novel schema-aligned paraphrasing framework that leverages SQL-to-NL (SQL2NL) to automatically generate semantically equivalent, lexically diverse queries while maintaining alignment with the original schema and intent. This enables the first targeted evaluation of NL2SQL robustness to linguistic variation in isolation-distinct from prior work that primarily investigates ambiguity or schema perturbations. Our analysis reveals that state-of-the-art models are far more brittle than standard benchmarks suggest. For example, LLaMa3.3-70B exhibits a 10.23% drop in execution accuracy (from 77.11% to 66.9%) on paraphrased Spider queries, while LLaMa3.1-8B suffers an even larger drop of nearly 20% (from 62.9% to 42.5%). Smaller models (e.g., GPT-4o mini) are disproportionately affected. We also find that robustness degradation varies significantly with query complexity, dataset, and domain -- highlighting the need for evaluation frameworks that explicitly measure linguistic generalization to ensure reliable performance in real-world settings.
Related papers
- DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction [46.422626657078666]
We present DeKeyNLU, a novel dataset which contains 1,500 meticulously annotated QA pairs.<n>We propose DeKey, a RAG-based NL2 pipeline that employs three separate modules for user question understanding, entity retrieval, and generation.
arXiv Detail & Related papers (2025-09-18T00:47:56Z) - GBV-SQL: Guided Generation and SQL2Text Back-Translation Validation for Multi-Agent Text2SQL [12.455525963127497]
GBV- is a novel multi-agent framework that introduces Guided Generation with SQL2Text Back-translation Validation.<n>This mechanism uses a specialized agent to translate the generatedsql back into natural language, which verifies its logical alignment with the original question.<n>We introduce a formal typology for "Gold Errors", which are pervasive flaws in the ground-truth and demonstrate how they obscure true model performance.
arXiv Detail & Related papers (2025-09-16T03:21:12Z) - RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z) - Grounding Natural Language to SQL Translation with Data-Based Self-Explanations [7.4643285253289475]
Cycle is a framework designed for end-to-end translation models to autonomously generate the best output through self-evaluation.<n>The main idea is to introduce data-grounded NL explanations as self-provided feedback, and use the feedback to validate the correctness of translation.<n>The results show that 1) the feedback loop introduced in Cycle can consistently improve the performance of existing models, and in particular, by applying Cycle to RESD, obtains a translation accuracy of 82.0% (+2.6%) on the validation set, and 81.6% (+3.2%) on the test set benchmark.
arXiv Detail & Related papers (2024-11-05T09:44:53Z) - RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction.
benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection.
Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z) - ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models [8.618945530676614]
Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM) suffer from inherent limitations that can misrepresent performance.<n>We introduce a new metric, Enhanced Tree Matching (ETM), which mitigates these issues by comparing queries using both syntactic and semantic elements.<n>We show that ETM and ESM can produce false positive and negative rates as high as 23.0% and 28.9%, while ETM reduces these rates to 0.3% and 2.7%, respectively.
arXiv Detail & Related papers (2024-07-10T02:20:19Z) - Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL
Robustness [115.66421993459663]
Recent studies reveal that text-to- models are vulnerable to task-specific perturbations.
We propose a comprehensive robustness benchmark based on Spider to diagnose the model.
We conduct a diagnostic study of the state-of-the-art models on the set.
arXiv Detail & Related papers (2023-01-21T03:57:18Z) - Holistic Evaluation of Language Models [183.94891340168175]
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood.
We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
arXiv Detail & Related papers (2022-11-16T18:51:34Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.