Related papers: Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

URL: http://arxiv.org/abs/2301.08881v1
Date: Sat, 21 Jan 2023 03:57:18 GMT
Title: Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness
Authors: Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Bing Xiang
Abstract summary: Recent studies reveal that text-to- models are vulnerable to task-specific perturbations. We propose a comprehensive robustness benchmark based on Spider to diagnose the model. We conduct a diagnostic study of the state-of-the-art models on the set.
Score: 115.66421993459663
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.

Related papers

Confidence Estimation for Error Detection in Text-to-SQL Systems [5.636160825241556]
This study investigates the integration of selective classifiers into Text-to-learning systems. We show that encoder-decoder T5 is better calibrated than in-context GPT 4 and decoder-only Llama 3. In terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.
arXiv Detail & Related papers (2025-01-16T13:23:07Z)
Solid-SQL: Enhanced Schema-linking based In-context Learning for Robust Text-to-SQL [13.122218546167463]
Large language models (LLMs) have significantly improved the performance of text-to- systems. Many state-of-the-art (SOTA) approaches have overlooked the critical aspect of system robustness.
arXiv Detail & Related papers (2024-12-17T04:22:22Z)
TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring [11.78795632771211]
We introduce a novel benchmark designed to evaluate text-to- reliability as a model's ability to correctly handle any type of input question. We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches.
arXiv Detail & Related papers (2024-03-23T16:12:52Z)
CodeS: Towards Building Open-source Language Models for Text-to-SQL [42.11113113574589]
We introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B. CodeS is a fully open language model, which achieves superior accuracy with much smaller parameter sizes. We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark.
arXiv Detail & Related papers (2024-02-26T07:00:58Z)
Improving Generalization in Semantic Parsing by Increasing Natural Language Variation [67.13483734810852]
In this work, we use data augmentation to enhance robustness of text-to- semantic parsing. We leverage the capabilities of large language models to generate more realistic and diverse questions. Using only a few prompts, we achieve a two-fold increase in the number of questions in Spider.
arXiv Detail & Related papers (2024-02-13T18:48:23Z)
Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks. We also develop five innovative and effective annotation methods. We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z)
Towards Robustness of Text-to-SQL Models Against Natural and Realistic Adversarial Table Perturbation [38.00832631674398]
We introduce the Adversarial Table Perturbation (ATP) as a new attacking paradigm to measure the robustness of Text-to-textual models. We build a systematic adversarial training example generation framework for better contextualization of data. Experiments show that our approach not only brings the best improvement against table-side perturbations but also substantially empowers models against NL-side perturbations.
arXiv Detail & Related papers (2022-12-20T04:38:23Z)
A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models [81.15974174627785]
We study the behavior of language models in terms of robustness and sensitivity to direct interventions in the input space. Our analysis shows that robustness does not appear to continuously improve as a function of size, but the GPT-3 Davinci models (175B) achieve a dramatic improvement in both robustness and sensitivity compared to all other GPT variants.
arXiv Detail & Related papers (2022-10-21T15:12:37Z)
SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN) Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z)
Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.