Related papers: Semantic Evaluation for Text-to-SQL with Distilled Test Suites

Semantic Evaluation for Text-to-SQL with Distilled Test Suites

URL: http://arxiv.org/abs/2010.02840v1
Date: Tue, 6 Oct 2020 16:04:12 GMT
Title: Semantic Evaluation for Text-to-SQL with Distilled Test Suites
Authors: Ruiqi Zhong, Tao Yu, Dan Klein
Abstract summary: We propose test suite accuracy to approximate accuracy for Text-to- semantic models. We use our proposed method to evaluate 21 models submitted to the Spider leader board and manually verify that our method is always correct on 100 examples.
Score: 46.42548219378393
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose test suite accuracy to approximate semantic accuracy for Text-to-SQL models. Our method distills a small test suite of databases that achieves high code coverage for the gold query from a large number of randomly generated databases. At evaluation time, it computes the denotation accuracy of the predicted queries on the distilled test suite, hence calculating a tight upper-bound for semantic accuracy efficiently. We use our proposed method to evaluate 21 models submitted to the Spider leader board and manually verify that our method is always correct on 100 examples. In contrast, the current Spider metric leads to a 2.5% false negative rate on average and 8.1% in the worst case, indicating that test suite accuracy is needed. Our implementation, along with distilled test suites for eleven Text-to-SQL datasets, is publicly available.

Related papers

RetrySQL: text-to-SQL training with retry data for self-correcting query generation [1.6707278580444538]
We introduce Retry, a new approach to training text-to-generation models.<n>We demonstrate that retry steps yield an improvement of up to 4 percentage points in both overall and challenging execution accuracy metrics.
arXiv Detail & Related papers (2025-07-03T11:00:49Z)
RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z)
Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies [28.281517110365037]
We study the problem of providing a calibrated confidence score that conveys the likelihood of an output query being correct.<n>Our work is the first to establish a benchmark for post-hoc calibration of text-to- parsing.
arXiv Detail & Related papers (2025-05-27T01:01:55Z)
Text-to-SQL Calibration: No Need to Ask -- Just Rescale Model Probabilities [20.606333546028516]
We show that a straightforward baseline -- deriving confidence from the model's full-sequence probability -- outperforms recent methods. Our comprehensive evaluation, conducted across two widely-used Text-to-checking benchmarks and multiple architectures, provides valuable insights into the effectiveness of various calibration strategies.
arXiv Detail & Related papers (2024-11-23T19:20:24Z)
Context-Aware SQL Error Correction Using Few-Shot Learning -- A Novel Approach Based on NLQ, Error, and SQL Similarity [0.0]
This paper introduces a novel few-shot learning-based approach for error correction insql generation. It enhances the accuracy of generated queries by selecting the most suitable few-shot error correction examples for a given natural language question (NLQ) In experiments with the open-source dataset, the proposed model offers a 39.2% increase in fixing errors with no error correction and a 10% increase from a simple error correction method.
arXiv Detail & Related papers (2024-10-11T18:22:08Z)
FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark [8.445403382578167]
This paper introduces FLEX (False-Lesscution EXecution), a novel approach to evaluating text-to-technical systems. Our metric improves agreement with human experts with comprehensive context and sophisticated criteria. This work contributes to a more accurate and nuanced evaluation of text-to-technical systems, potentially reshaping our understanding of state-of-the-art performance in this field.
arXiv Detail & Related papers (2024-09-24T01:40:50Z)
SQLFixAgent: Towards Semantic-Accurate Text-to-SQL Parsing via Consistency-Enhanced Multi-Agent Collaboration [26.193588535592767]
We introduce a new consistency-enhanced multi-agent collaborative framework designed for detecting and repairing erroneous SQL. We evaluate our proposed framework on five Text-to-text benchmarks. Our method consistently enhances the performance of the baseline model. Our framework has a higher token efficiency compared to other advanced methods, making it more competitive.
arXiv Detail & Related papers (2024-06-19T09:57:19Z)
Neural Embeddings for Web Testing [49.66745368789056]
Existing crawlers rely on app-specific, threshold-based, algorithms to assess state equivalence. We propose WEBEMBED, a novel abstraction function based on neural network embeddings and threshold-free classifiers. Our evaluation on nine web apps shows that WEBEMBED outperforms state-of-the-art techniques by detecting near-duplicates more accurately.
arXiv Detail & Related papers (2023-06-12T19:59:36Z)
Learning Deep Semantics for Test Completion [46.842174440120196]
We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test. We develop TeCo -- a deep learning model using code semantics for test completion.
arXiv Detail & Related papers (2023-02-20T18:53:56Z)
An ensemble meta-estimator to predict source code testability [1.4213973379473652]
The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness. This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite.
arXiv Detail & Related papers (2022-08-20T06:18:16Z)
Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z)
Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy. After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z)
ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples. We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.