Controllable Data Augmentation for Context-Dependent Text-to-SQL
- URL: http://arxiv.org/abs/2304.13902v2
- Date: Fri, 28 Apr 2023 02:45:31 GMT
- Title: Controllable Data Augmentation for Context-Dependent Text-to-SQL
- Authors: Dingzirui Wang, Longxu Dou, Wanxiang Che
- Abstract summary: We introduce ConDA, which generates interactive questions and correspondingsql results.
We also present a filter method to ensure the data quality by a grounding model.
We analyze the augmented data, which reveals that the data generated by ConDA are of high quality.
- Score: 46.11511797999039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The limited scale of annotated data constraints existing context-dependent
text-to-SQL models because of the complexity of labeling. The data augmentation
method is a commonly used method to solve this problem. However, the data
generated by current augmentation methods often lack diversity. In this paper,
we introduce ConDA, which generates interactive questions and corresponding SQL
results. We designed the SQL dialogue state to enhance the data diversity
through the state transition. Meanwhile, we also present a filter method to
ensure the data quality by a grounding model. Additionally, we utilize a
grounding model to identify and filter low-quality questions that mismatch the
state information. Experimental results on the SParC and CoSQL datasets show
that ConDA boosts the baseline model to achieve an average improvement of
$3.3\%$ on complex questions. Moreover, we analyze the augmented data, which
reveals that the data generated by ConDA are of high quality in both SQL
template hardness and types, turns, and question consistency.
Related papers
- Domain Specific Question to SQL Conversion with Embedded Data Balancing Technique [0.0]
This paper proposes two intermediations to improve the accuracy of structured query language models.
The proposed solution achieved 10.98 percentage improvement in accuracy of the model performance compared to the state of the art model tested on Wiki dataset.
arXiv Detail & Related papers (2025-03-28T08:58:14Z) - Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation [26.834687657847454]
Text-to-sql models are increasingly adopted in real-world applications.
deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications.
We find that existing text-to-sql models experience significant performance drops when applied to new schemas.
Continuously obtaining high-quality text-to-sql data for evolving schemas is prohibitively expensive in real-world scenarios.
arXiv Detail & Related papers (2025-02-21T22:32:35Z) - Domain Adaptation of a State of the Art Text-to-SQL Model: Lessons
Learned and Challenges Found [1.9963385352536616]
We analyze how well the base T5 Language Model and Picard perform on query structures different from the Spider dataset.
We present an alternative way to disambiguate the values in an input question using a rule-based approach.
arXiv Detail & Related papers (2023-12-09T03:30:21Z) - Wav2SQL: Direct Generalizable Speech-To-SQL Parsing [55.10009651476589]
Speech-to-Spider (S2Spider) aims to convert spoken questions intosql queries given databases.
We propose the first direct speech-to-speaker parsing model Wav2 which avoids error compounding across cascaded systems.
Experimental results demonstrate that Wav2 avoids error compounding and achieves state-of-the-art results by up to 2.5% accuracy improvement over the baseline.
arXiv Detail & Related papers (2023-05-21T19:26:46Z) - SPSQL: Step-by-step Parsing Based Framework for Text-to-SQL Generation [13.196264569882777]
The current mainstream end-to-end Text2 model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters.
This paper proposes a pipeline method: SP Experiments to achieve the desired result.
We construct the dataset based on the marketing business data of the State Grid Corporation of China.
arXiv Detail & Related papers (2023-05-10T10:01:36Z) - Conversational Text-to-SQL: An Odyssey into State-of-the-Art and
Challenges Ahead [6.966624873109535]
State-of-the-art (SOTA) systems use large, pre-trained and finetuned language models, such as the T5-family.
With multi-tasking (MT) over coherent tasks with discrete prompts during training, we improve over specialized text-to-three models.
We conduct studies to tease apart errors attributable to domain and compositional generalization.
arXiv Detail & Related papers (2023-02-21T23:15:33Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Augmenting Multi-Turn Text-to-SQL Datasets with Self-Play [46.07002748587857]
We explore augmenting the training datasets using self-play, which leverages contextual information to synthesize new interactions.
We find that self-play improves the accuracy of a strong baseline on SParC and Co, two widely used text-to-domain datasets.
arXiv Detail & Related papers (2022-10-21T16:40:07Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z) - IGSQL: Database Schema Interaction Graph Based Neural Model for
Context-Dependent Text-to-SQL Generation [61.09660709356527]
We propose a database schema interaction graph encoder to utilize historicalal information of database schema items.
We evaluate our model on the benchmark SParC and Co datasets.
arXiv Detail & Related papers (2020-11-11T12:56:21Z) - Data Agnostic RoBERTa-based Natural Language to SQL Query Generation [0.0]
The NL2 task aims at finding deep learning approaches to solve the problem converting by natural language questions into valid queries.
We have presented an approach with data privacy at its core.
Although we have not achieved state of the art results, we have eliminated the need for the table right from the training of the model.
arXiv Detail & Related papers (2020-10-11T13:18:46Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.