Related papers: Improving Generalization in Language Model-Based Text-to-SQL Semantic Parsing: Two Simple Semantic Boundary-Based Techniques

Improving Generalization in Language Model-Based Text-to-SQL Semantic Parsing: Two Simple Semantic Boundary-Based Techniques

URL: http://arxiv.org/abs/2305.17378v1
Date: Sat, 27 May 2023 06:09:03 GMT
Title: Improving Generalization in Language Model-Based Text-to-SQL Semantic Parsing: Two Simple Semantic Boundary-Based Techniques
Authors: Daking Rai, Bailin Wang, Yilun Zhou and Ziyu Yao
Abstract summary: We introduce a token preprocessing method to preserve the semantic boundaries of tokens produced by LM tokenizers. At the sequence level, we propose to use special tokens to mark the boundaries of components aligned between input and output. Our experimental results on two text-to- semantic parsing datasets show that our token preprocessing, although simple, can substantially improve the LM performance on both types of generalization.
Score: 14.634536051274468
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Compositional and domain generalization present significant challenges in semantic parsing, even for state-of-the-art semantic parsers based on pre-trained language models (LMs). In this study, we empirically investigate improving an LM's generalization in semantic parsing with two simple techniques: at the token level, we introduce a token preprocessing method to preserve the semantic boundaries of tokens produced by LM tokenizers; at the sequence level, we propose to use special tokens to mark the boundaries of components aligned between input and output. Our experimental results on two text-to-SQL semantic parsing datasets show that our token preprocessing, although simple, can substantially improve the LM performance on both types of generalization, and our component boundary marking method is particularly helpful for compositional generalization.

Related papers

Enhancing LLM Character-Level Manipulation via Divide and Conquer [74.55804812450164]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks. They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution. We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [44.84219266082269]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data. We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning [20.100484034021285]
Token Internal Position Awareness (TIPA) is a novel approach that enhances LLMs' understanding of internal token structures. TIPA enables models to effectively learn and generalize character positions and internal structures.
arXiv Detail & Related papers (2024-11-26T18:44:39Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs. We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
NormXLogit: The Head-on-Top Never Lies [15.215985417763472]
Transformer architecture has emerged as the dominant choice for building large language models. We propose a novel technique, called NormXLogit, for assessing the significance of individual input tokens. We show that our approach consistently outperforms existing gradient-based methods in terms of faithfulness.
arXiv Detail & Related papers (2024-11-25T10:12:27Z)
A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z)
Cross-domain Chinese Sentence Pattern Parsing [67.1381983012038]
Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching. Existing SPSs rely heavily on textbook corpora for training, lacking cross-domain capability. This paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework.
arXiv Detail & Related papers (2024-02-26T05:30:48Z)
Translate First Reorder Later: Leveraging Monotonicity in Semantic Parsing [4.396860522241306]
TPol is a two-step approach that translates input sentences monotonically and then reorders them to obtain the correct output. We test our approach on two popular semantic parsing datasets.
arXiv Detail & Related papers (2022-10-10T17:50:42Z)
Guiding the PLMs with Semantic Anchors as Intermediate Supervision: Towards Interpretable Semantic Parsing [57.11806632758607]
We propose to incorporate the current pretrained language models with a hierarchical decoder network. By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks. We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines.
arXiv Detail & Related papers (2022-10-04T07:27:29Z)
Neural Token Segmentation for High Token-Internal Complexity [7.569526565230962]
Tokenizing raw texts into word units is an essential pre-processing step for NLP pipelines. We propose a novel neural segmentation model which combines contextualised token representation and char-level decoding. Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art.
arXiv Detail & Related papers (2022-03-21T10:07:17Z)
Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z)
SLM: Learning a Discourse Language Representation with Sentence Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation. We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
Compositional Generalization via Semantic Tagging [81.24269148865555]
We propose a new decoding framework that preserves the expressivity and generality of sequence-to-sequence models. We show that the proposed approach consistently improves compositional generalization across model architectures, domains, and semantic formalisms.
arXiv Detail & Related papers (2020-10-22T15:55:15Z)
Filtered Inner Product Projection for Crosslingual Embedding Alignment [28.72288652451881]
Filtered Inner Product Projection (FIPP) is a method for mapping embeddings to a common representation space. FIPP is applicable even when the source and target embeddings are of differing dimensionalities. We show that our approach outperforms existing methods on the MUSE dataset for various language pairs.
arXiv Detail & Related papers (2020-06-05T19:53:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.