Semantic Evaluation for Text-to-SQL with Distilled Test Suites
- URL: http://arxiv.org/abs/2010.02840v1
- Date: Tue, 6 Oct 2020 16:04:12 GMT
- Title: Semantic Evaluation for Text-to-SQL with Distilled Test Suites
- Authors: Ruiqi Zhong, Tao Yu, Dan Klein
- Abstract summary: We propose test suite accuracy to approximate accuracy for Text-to- semantic models.
We use our proposed method to evaluate 21 models submitted to the Spider leader board and manually verify that our method is always correct on 100 examples.
- Score: 46.42548219378393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose test suite accuracy to approximate semantic accuracy for
Text-to-SQL models. Our method distills a small test suite of databases that
achieves high code coverage for the gold query from a large number of randomly
generated databases. At evaluation time, it computes the denotation accuracy of
the predicted queries on the distilled test suite, hence calculating a tight
upper-bound for semantic accuracy efficiently. We use our proposed method to
evaluate 21 models submitted to the Spider leader board and manually verify
that our method is always correct on 100 examples. In contrast, the current
Spider metric leads to a 2.5% false negative rate on average and 8.1% in the
worst case, indicating that test suite accuracy is needed. Our implementation,
along with distilled test suites for eleven Text-to-SQL datasets, is publicly
available.
Related papers
- ProbGate at EHRSQL 2024: Enhancing SQL Query Generation Accuracy through Probabilistic Threshold Filtering and Error Handling [0.0]
We introduce an entropy-based method to identify and filter out unanswerable results.
We experimentally verified that our method can filter unanswerable questions, which can be widely utilized.
arXiv Detail & Related papers (2024-04-25T14:55:07Z) - TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring [11.78795632771211]
We introduce a novel benchmark designed to evaluate text-to- reliability as a model's ability to correctly handle any type of input question.
We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches.
arXiv Detail & Related papers (2024-03-23T16:12:52Z) - Neural Embeddings for Web Testing [49.66745368789056]
Existing crawlers rely on app-specific, threshold-based, algorithms to assess state equivalence.
We propose WEBEMBED, a novel abstraction function based on neural network embeddings and threshold-free classifiers.
Our evaluation on nine web apps shows that WEBEMBED outperforms state-of-the-art techniques by detecting near-duplicates more accurately.
arXiv Detail & Related papers (2023-06-12T19:59:36Z) - Error Detection for Text-to-SQL Semantic Parsing [18.068244400731366]
Modern text-to- semantics are often over-confident, casting doubt on their trustworthiness when deployed for real use.
We propose a-independent error detection model for text-to- semantic parsing.
arXiv Detail & Related papers (2023-05-23T04:44:22Z) - Learning Deep Semantics for Test Completion [46.842174440120196]
We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test.
We develop TeCo -- a deep learning model using code semantics for test completion.
arXiv Detail & Related papers (2023-02-20T18:53:56Z) - An ensemble meta-estimator to predict source code testability [1.4213973379473652]
The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness.
This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite.
arXiv Detail & Related papers (2022-08-20T06:18:16Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z) - Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy.
After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.