Related papers: Superbizarre Is Not Superb: Improving BERT's Interpretations of Complex Words with Derivational Morphology

Superbizarre Is Not Superb: Improving BERT's Interpretations of Complex Words with Derivational Morphology

URL: http://arxiv.org/abs/2101.00403v1
Date: Sat, 2 Jan 2021 08:36:48 GMT
Title: Superbizarre Is Not Superb: Improving BERT's Interpretations of Complex Words with Derivational Morphology
Authors: Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Sch\"utze
Abstract summary: We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.
Score: 13.535770763481905
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How does the input segmentation of pretrained language models (PLMs) affect their generalization capabilities? We present the first study investigating this question, taking BERT as the example PLM and focusing on the semantic representations of derivationally complex words. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which derivational segmentation consistently outperforms BERT's WordPiece segmentation by a large margin. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

Related papers

From Tokens to Words: On the Inner Lexicon of LLMs [7.148628740938674]
Natural language is composed of words, but modern LLMs process sub-words as input. We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent word representations. Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope.
arXiv Detail & Related papers (2024-10-08T09:53:35Z)
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT [59.245414547751636]
We propose a circuit discovery framework alternative to activation patching. Our framework suffers less from out-of-distribution and proves to be more efficient in terms of complexity. We dig in a small transformer trained on a synthetic task named Othello and find a number of human-understandable fine-grained circuits inside of it.
arXiv Detail & Related papers (2024-02-19T15:04:53Z)
Improving Generalization in Language Model-Based Text-to-SQL Semantic Parsing: Two Simple Semantic Boundary-Based Techniques [14.634536051274468]
We introduce a token preprocessing method to preserve the semantic boundaries of tokens produced by LM tokenizers. At the sequence level, we propose to use special tokens to mark the boundaries of components aligned between input and output. Our experimental results on two text-to- semantic parsing datasets show that our token preprocessing, although simple, can substantially improve the LM performance on both types of generalization.
arXiv Detail & Related papers (2023-05-27T06:09:03Z)
BERM: Training the Balanced and Extractable Representation for Matching to Improve Generalization Ability of Dense Retrieval [54.66399120084227]
We propose a novel method to improve the generalization of dense retrieval via capturing matching signal called BERM. Dense retrieval has shown promise in the first-stage retrieval process when trained on in-domain labeled datasets.
arXiv Detail & Related papers (2023-05-18T15:43:09Z)
Guiding the PLMs with Semantic Anchors as Intermediate Supervision: Towards Interpretable Semantic Parsing [57.11806632758607]
We propose to incorporate the current pretrained language models with a hierarchical decoder network. By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks. We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines.
arXiv Detail & Related papers (2022-10-04T07:27:29Z)
Breaking Character: Are Subwords Good Enough for MRLs After All? [36.11778282905458]
We pretraining a BERT-style language model over character sequences instead of word-pieces. We compare the resulting model, dubbed TavBERT, against contemporary PLMs based on subwords for three highly complex and ambiguous MRLs. Our results show, for all tested languages, that while TavBERT obtains mild improvements on surface-level tasks, subword-based PLMs achieve significantly higher performance on semantic tasks.
arXiv Detail & Related papers (2022-04-10T18:54:43Z)
Distributed Word Representation in Tsetlin Machine [14.62945824459286]
Tsetlin Machine (TM) is an interpretable pattern recognition algorithm based on propositional logic. We propose a novel way of using pre-trained word representations for TM. The approach significantly enhances TM performance and maintains interpretability at the same time.
arXiv Detail & Related papers (2021-04-14T14:48:41Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining. We distill the approximate marginal distribution over words in context from the syntactic LM. Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity. Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset. We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.