Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL
Robustness
- URL: http://arxiv.org/abs/2301.08881v1
- Date: Sat, 21 Jan 2023 03:57:18 GMT
- Title: Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL
Robustness
- Authors: Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu,
Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien,
Steve Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng,
Bing Xiang
- Abstract summary: Recent studies reveal that text-to- models are vulnerable to task-specific perturbations.
We propose a comprehensive robustness benchmark based on Spider to diagnose the model.
We conduct a diagnostic study of the state-of-the-art models on the set.
- Score: 115.66421993459663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural text-to-SQL models have achieved remarkable performance in translating
natural language questions into SQL queries. However, recent studies reveal
that text-to-SQL models are vulnerable to task-specific perturbations. Previous
curated robustness test sets usually focus on individual phenomena. In this
paper, we propose a comprehensive robustness benchmark based on Spider, a
cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design
17 perturbations on databases, natural language questions, and SQL queries to
measure the robustness from different angles. In order to collect more
diversified natural question perturbations, we utilize large pretrained
language models (PLMs) to simulate human behaviors in creating natural
questions. We conduct a diagnostic study of the state-of-the-art models on the
robustness set. Experimental results reveal that even the most robust model
suffers from a 14.0% performance drop overall and a 50.7% performance drop on
the most challenging perturbation. We also present a breakdown analysis
regarding text-to-SQL model designs and provide insights for improving model
robustness.
Related papers
- TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring [11.78795632771211]
We introduce a novel benchmark designed to evaluate text-to- reliability as a model's ability to correctly handle any type of input question.
We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches.
arXiv Detail & Related papers (2024-03-23T16:12:52Z) - CodeS: Towards Building Open-source Language Models for Text-to-SQL [42.11113113574589]
We introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B.
CodeS is a fully open language model, which achieves superior accuracy with much smaller parameter sizes.
We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark.
arXiv Detail & Related papers (2024-02-26T07:00:58Z) - Improving Generalization in Semantic Parsing by Increasing Natural
Language Variation [67.13483734810852]
In this work, we use data augmentation to enhance robustness of text-to- semantic parsing.
We leverage the capabilities of large language models to generate more realistic and diverse questions.
Using only a few prompts, we achieve a two-fold increase in the number of questions in Spider.
arXiv Detail & Related papers (2024-02-13T18:48:23Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - Towards Robustness of Text-to-SQL Models Against Natural and Realistic
Adversarial Table Perturbation [38.00832631674398]
We introduce the Adversarial Table Perturbation (ATP) as a new attacking paradigm to measure the robustness of Text-to-textual models.
We build a systematic adversarial training example generation framework for better contextualization of data.
Experiments show that our approach not only brings the best improvement against table-side perturbations but also substantially empowers models against NL-side perturbations.
arXiv Detail & Related papers (2022-12-20T04:38:23Z) - A Causal Framework to Quantify the Robustness of Mathematical Reasoning
with Language Models [81.15974174627785]
We study the behavior of language models in terms of robustness and sensitivity to direct interventions in the input space.
Our analysis shows that robustness does not appear to continuously improve as a function of size, but the GPT-3 Davinci models (175B) achieve a dramatic improvement in both robustness and sensitivity compared to all other GPT variants.
arXiv Detail & Related papers (2022-10-21T15:12:37Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.