Exploring Underexplored Limitations of Cross-Domain Text-to-SQL
Generalization
- URL: http://arxiv.org/abs/2109.05157v1
- Date: Sat, 11 Sep 2021 02:01:04 GMT
- Title: Exploring Underexplored Limitations of Cross-Domain Text-to-SQL
Generalization
- Authors: Yujian Gan, Xinyun Chen, Matthew Purver
- Abstract summary: Existing text-to-curated models do not generalize when facing domain knowledge that does not frequently appear in the training data.
In this work, we investigate the robustness of text-to-curated models when the questions require rarely observed domain knowledge.
We demonstrate that the prediction accuracy dramatically drops on samples that require such domain knowledge, even if the domain knowledge appears in the training set.
- Score: 20.550737675032448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been significant progress in studying neural networks for
translating text descriptions into SQL queries under the zero-shot cross-domain
setting. Despite achieving good performance on some public benchmarks, we
observe that existing text-to-SQL models do not generalize when facing domain
knowledge that does not frequently appear in the training data, which may
render the worse prediction performance for unseen domains. In this work, we
investigate the robustness of text-to-SQL models when the questions require
rarely observed domain knowledge. In particular, we define five types of domain
knowledge and introduce Spider-DK (DK is the abbreviation of domain knowledge),
a human-curated dataset based on the Spider benchmark for text-to-SQL
translation. NL questions in Spider-DK are selected from Spider, and we modify
some samples by adding domain knowledge that reflects real-world question
paraphrases. We demonstrate that the prediction accuracy dramatically drops on
samples that require such domain knowledge, even if the domain knowledge
appears in the training set, and the model provides the correct predictions for
related training samples.
Related papers
- Improving Generalization in Semantic Parsing by Increasing Natural
Language Variation [67.13483734810852]
In this work, we use data augmentation to enhance robustness of text-to- semantic parsing.
We leverage the capabilities of large language models to generate more realistic and diverse questions.
Using only a few prompts, we achieve a two-fold increase in the number of questions in Spider.
arXiv Detail & Related papers (2024-02-13T18:48:23Z) - Domain Adaptation of a State of the Art Text-to-SQL Model: Lessons
Learned and Challenges Found [1.9963385352536616]
We analyze how well the base T5 Language Model and Picard perform on query structures different from the Spider dataset.
We present an alternative way to disambiguate the values in an input question using a rule-based approach.
arXiv Detail & Related papers (2023-12-09T03:30:21Z) - Adapting Knowledge for Few-shot Table-to-Text Generation [35.59842534346997]
We propose a novel framework: Adapt-Knowledge-to-Generate (AKG)
AKG adapts unlabeled domain-specific knowledge into the model, which brings at least three benefits.
Our model achieves superior performance in terms of both fluency and accuracy as judged by human and automatic evaluations.
arXiv Detail & Related papers (2023-02-24T05:48:53Z) - Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic
Knowledge [54.85168428642474]
We build a new Chinese benchmark Know consisting of domain-specific questions covering various domains.
We then address this problem by presenting formulaic knowledge, rather than by annotating additional data examples.
More concretely, we construct a formulaic knowledge bank as a domain knowledge base and propose a framework (ReGrouP) to leverage this formulaic knowledge during parsing.
arXiv Detail & Related papers (2023-01-03T12:37:47Z) - DocuT5: Seq2seq SQL Generation with Table Documentation [5.586191108738563]
We develop a new text-to- taxonomy failure taxonomy and find that 19.6% of errors are due to foreign key mistakes.
We propose DocuT5, a method that captures knowledge from (1) table structure context of foreign keys and (2) domain knowledge through contextualizing tables and columns.
Both types of knowledge improve over state-of-the-art T5 with constrained decoding on Spider, and domain knowledge produces state-of-the-art comparable effectiveness on Spider-DK and Spider-SYN datasets.
arXiv Detail & Related papers (2022-11-11T13:31:55Z) - Using Language to Extend to Unseen Domains [81.37175826824625]
It is expensive to collect training data for every possible domain that a vision model may encounter when deployed.
We consider how simply verbalizing the training domain as well as domains we want to extend to but do not have data for can improve robustness.
Using a multimodal model with a joint image and language embedding space, our method LADS learns a transformation of the image embeddings from the training domain to each unseen test domain.
arXiv Detail & Related papers (2022-10-18T01:14:02Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers [26.15889661083109]
We present KDBaggleQA, a new cross-domain evaluation dataset of real Web databases.
We show that KDBaggleQA presents a challenge to state-of-the-art zero-shots but that a more realistic evaluation setting and creative use of associated database documentation boosts their accuracy by over 13.2%.
arXiv Detail & Related papers (2021-06-22T00:08:03Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.