Improving Generalization in Language Model-Based Text-to-SQL Semantic
Parsing: Two Simple Semantic Boundary-Based Techniques
- URL: http://arxiv.org/abs/2305.17378v1
- Date: Sat, 27 May 2023 06:09:03 GMT
- Title: Improving Generalization in Language Model-Based Text-to-SQL Semantic
Parsing: Two Simple Semantic Boundary-Based Techniques
- Authors: Daking Rai, Bailin Wang, Yilun Zhou and Ziyu Yao
- Abstract summary: We introduce a token preprocessing method to preserve the semantic boundaries of tokens produced by LM tokenizers.
At the sequence level, we propose to use special tokens to mark the boundaries of components aligned between input and output.
Our experimental results on two text-to- semantic parsing datasets show that our token preprocessing, although simple, can substantially improve the LM performance on both types of generalization.
- Score: 14.634536051274468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compositional and domain generalization present significant challenges in
semantic parsing, even for state-of-the-art semantic parsers based on
pre-trained language models (LMs). In this study, we empirically investigate
improving an LM's generalization in semantic parsing with two simple
techniques: at the token level, we introduce a token preprocessing method to
preserve the semantic boundaries of tokens produced by LM tokenizers; at the
sequence level, we propose to use special tokens to mark the boundaries of
components aligned between input and output. Our experimental results on two
text-to-SQL semantic parsing datasets show that our token preprocessing,
although simple, can substantially improve the LM performance on both types of
generalization, and our component boundary marking method is particularly
helpful for compositional generalization.
Related papers
- Enhancing LLM Character-Level Manipulation via Divide and Conquer [108.6908427615402]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks.
They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution.
We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [44.84219266082269]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.
We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - Cross-domain Chinese Sentence Pattern Parsing [67.1381983012038]
Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching.
Existing SPSs rely heavily on textbook corpora for training, lacking cross-domain capability.
This paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework.
arXiv Detail & Related papers (2024-02-26T05:30:48Z) - Translate First Reorder Later: Leveraging Monotonicity in Semantic
Parsing [4.396860522241306]
TPol is a two-step approach that translates input sentences monotonically and then reorders them to obtain the correct output.
We test our approach on two popular semantic parsing datasets.
arXiv Detail & Related papers (2022-10-10T17:50:42Z) - Guiding the PLMs with Semantic Anchors as Intermediate Supervision:
Towards Interpretable Semantic Parsing [57.11806632758607]
We propose to incorporate the current pretrained language models with a hierarchical decoder network.
By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks.
We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines.
arXiv Detail & Related papers (2022-10-04T07:27:29Z) - Neural Token Segmentation for High Token-Internal Complexity [7.569526565230962]
Tokenizing raw texts into word units is an essential pre-processing step for NLP pipelines.
We propose a novel neural segmentation model which combines contextualised token representation and char-level decoding.
Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art.
arXiv Detail & Related papers (2022-03-21T10:07:17Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Compositional Generalization via Semantic Tagging [81.24269148865555]
We propose a new decoding framework that preserves the expressivity and generality of sequence-to-sequence models.
We show that the proposed approach consistently improves compositional generalization across model architectures, domains, and semantic formalisms.
arXiv Detail & Related papers (2020-10-22T15:55:15Z) - Filtered Inner Product Projection for Crosslingual Embedding Alignment [28.72288652451881]
Filtered Inner Product Projection (FIPP) is a method for mapping embeddings to a common representation space.
FIPP is applicable even when the source and target embeddings are of differing dimensionalities.
We show that our approach outperforms existing methods on the MUSE dataset for various language pairs.
arXiv Detail & Related papers (2020-06-05T19:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.