Improving Generalization in Language Model-Based Text-to-SQL Semantic
Parsing: Two Simple Semantic Boundary-Based Techniques
- URL: http://arxiv.org/abs/2305.17378v1
- Date: Sat, 27 May 2023 06:09:03 GMT
- Title: Improving Generalization in Language Model-Based Text-to-SQL Semantic
Parsing: Two Simple Semantic Boundary-Based Techniques
- Authors: Daking Rai, Bailin Wang, Yilun Zhou and Ziyu Yao
- Abstract summary: We introduce a token preprocessing method to preserve the semantic boundaries of tokens produced by LM tokenizers.
At the sequence level, we propose to use special tokens to mark the boundaries of components aligned between input and output.
Our experimental results on two text-to- semantic parsing datasets show that our token preprocessing, although simple, can substantially improve the LM performance on both types of generalization.
- Score: 14.634536051274468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compositional and domain generalization present significant challenges in
semantic parsing, even for state-of-the-art semantic parsers based on
pre-trained language models (LMs). In this study, we empirically investigate
improving an LM's generalization in semantic parsing with two simple
techniques: at the token level, we introduce a token preprocessing method to
preserve the semantic boundaries of tokens produced by LM tokenizers; at the
sequence level, we propose to use special tokens to mark the boundaries of
components aligned between input and output. Our experimental results on two
text-to-SQL semantic parsing datasets show that our token preprocessing,
although simple, can substantially improve the LM performance on both types of
generalization, and our component boundary marking method is particularly
helpful for compositional generalization.
Related papers
- Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning [20.100484034021285]
Token Internal Position Awareness (TIPA) is a novel approach that enhances LLMs' understanding of internal token structures.
TIPA enables models to effectively learn and generalize character positions and internal structures.
arXiv Detail & Related papers (2024-11-26T18:44:39Z) - NormXLogit: The Head-on-Top Never Lies [15.215985417763472]
Transformer architecture has emerged as the dominant choice for building large language models.
We propose a novel technique, called NormXLogit, for assessing the significance of individual input tokens.
We show that our approach consistently outperforms existing gradient-based methods in terms of faithfulness.
arXiv Detail & Related papers (2024-11-25T10:12:27Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Cross-domain Chinese Sentence Pattern Parsing [67.1381983012038]
Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching.
Existing SPSs rely heavily on textbook corpora for training, lacking cross-domain capability.
This paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework.
arXiv Detail & Related papers (2024-02-26T05:30:48Z) - Translate First Reorder Later: Leveraging Monotonicity in Semantic
Parsing [4.396860522241306]
TPol is a two-step approach that translates input sentences monotonically and then reorders them to obtain the correct output.
We test our approach on two popular semantic parsing datasets.
arXiv Detail & Related papers (2022-10-10T17:50:42Z) - Guiding the PLMs with Semantic Anchors as Intermediate Supervision:
Towards Interpretable Semantic Parsing [57.11806632758607]
We propose to incorporate the current pretrained language models with a hierarchical decoder network.
By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks.
We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines.
arXiv Detail & Related papers (2022-10-04T07:27:29Z) - Neural Token Segmentation for High Token-Internal Complexity [7.569526565230962]
Tokenizing raw texts into word units is an essential pre-processing step for NLP pipelines.
We propose a novel neural segmentation model which combines contextualised token representation and char-level decoding.
Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art.
arXiv Detail & Related papers (2022-03-21T10:07:17Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Compositional Generalization via Semantic Tagging [81.24269148865555]
We propose a new decoding framework that preserves the expressivity and generality of sequence-to-sequence models.
We show that the proposed approach consistently improves compositional generalization across model architectures, domains, and semantic formalisms.
arXiv Detail & Related papers (2020-10-22T15:55:15Z) - Filtered Inner Product Projection for Crosslingual Embedding Alignment [28.72288652451881]
Filtered Inner Product Projection (FIPP) is a method for mapping embeddings to a common representation space.
FIPP is applicable even when the source and target embeddings are of differing dimensionalities.
We show that our approach outperforms existing methods on the MUSE dataset for various language pairs.
arXiv Detail & Related papers (2020-06-05T19:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.