Semantic Evaluation for Text-to-SQL with Distilled Test Suites
- URL: http://arxiv.org/abs/2010.02840v1
- Date: Tue, 6 Oct 2020 16:04:12 GMT
- Title: Semantic Evaluation for Text-to-SQL with Distilled Test Suites
- Authors: Ruiqi Zhong, Tao Yu, Dan Klein
- Abstract summary: We propose test suite accuracy to approximate accuracy for Text-to- semantic models.
We use our proposed method to evaluate 21 models submitted to the Spider leader board and manually verify that our method is always correct on 100 examples.
- Score: 46.42548219378393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose test suite accuracy to approximate semantic accuracy for
Text-to-SQL models. Our method distills a small test suite of databases that
achieves high code coverage for the gold query from a large number of randomly
generated databases. At evaluation time, it computes the denotation accuracy of
the predicted queries on the distilled test suite, hence calculating a tight
upper-bound for semantic accuracy efficiently. We use our proposed method to
evaluate 21 models submitted to the Spider leader board and manually verify
that our method is always correct on 100 examples. In contrast, the current
Spider metric leads to a 2.5% false negative rate on average and 8.1% in the
worst case, indicating that test suite accuracy is needed. Our implementation,
along with distilled test suites for eleven Text-to-SQL datasets, is publicly
available.
Related papers
- Text-to-SQL Calibration: No Need to Ask -- Just Rescale Model Probabilities [20.606333546028516]
We show that a straightforward baseline -- deriving confidence from the model's full-sequence probability -- outperforms recent methods.
Our comprehensive evaluation, conducted across two widely-used Text-to-checking benchmarks and multiple architectures, provides valuable insights into the effectiveness of various calibration strategies.
arXiv Detail & Related papers (2024-11-23T19:20:24Z) - Context-Aware SQL Error Correction Using Few-Shot Learning -- A Novel Approach Based on NLQ, Error, and SQL Similarity [0.0]
This paper introduces a novel few-shot learning-based approach for error correction insql generation.
It enhances the accuracy of generated queries by selecting the most suitable few-shot error correction examples for a given natural language question (NLQ)
In experiments with the open-source dataset, the proposed model offers a 39.2% increase in fixing errors with no error correction and a 10% increase from a simple error correction method.
arXiv Detail & Related papers (2024-10-11T18:22:08Z) - FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark [8.445403382578167]
This paper introduces FLEX (False-Lesscution EXecution), a novel approach to evaluating text-to-technical systems.
Our metric improves agreement with human experts with comprehensive context and sophisticated criteria.
This work contributes to a more accurate and nuanced evaluation of text-to-technical systems, potentially reshaping our understanding of state-of-the-art performance in this field.
arXiv Detail & Related papers (2024-09-24T01:40:50Z) - Neural Embeddings for Web Testing [49.66745368789056]
Existing crawlers rely on app-specific, threshold-based, algorithms to assess state equivalence.
We propose WEBEMBED, a novel abstraction function based on neural network embeddings and threshold-free classifiers.
Our evaluation on nine web apps shows that WEBEMBED outperforms state-of-the-art techniques by detecting near-duplicates more accurately.
arXiv Detail & Related papers (2023-06-12T19:59:36Z) - Learning Deep Semantics for Test Completion [46.842174440120196]
We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test.
We develop TeCo -- a deep learning model using code semantics for test completion.
arXiv Detail & Related papers (2023-02-20T18:53:56Z) - An ensemble meta-estimator to predict source code testability [1.4213973379473652]
The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness.
This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite.
arXiv Detail & Related papers (2022-08-20T06:18:16Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z) - Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy.
After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.