Exploring BERT's Sensitivity to Lexical Cues using Tests from Semantic
Priming
- URL: http://arxiv.org/abs/2010.03010v1
- Date: Tue, 6 Oct 2020 20:30:59 GMT
- Title: Exploring BERT's Sensitivity to Lexical Cues using Tests from Semantic
Priming
- Authors: Kanishka Misra, Allyson Ettinger, Julia Taylor Rayz
- Abstract summary: We present a case study analyzing the pre-trained BERT model with tests informed by semantic priming.
We find that BERT too shows "priming," predicting a word with greater probability when the context includes a related word versus an unrelated one.
Follow-up analysis shows BERT to be increasingly distracted by related prime words as context becomes more informative.
- Score: 8.08493736237816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Models trained to estimate word probabilities in context have become
ubiquitous in natural language processing. How do these models use lexical cues
in context to inform their word probabilities? To answer this question, we
present a case study analyzing the pre-trained BERT model with tests informed
by semantic priming. Using English lexical stimuli that show priming in humans,
we find that BERT too shows "priming," predicting a word with greater
probability when the context includes a related word versus an unrelated one.
This effect decreases as the amount of information provided by the context
increases. Follow-up analysis shows BERT to be increasingly distracted by
related prime words as context becomes more informative, assigning lower
probabilities to related words. Our findings highlight the importance of
considering contextual constraint effects when studying word prediction in
these models, and highlight possible parallels with human processing.
Related papers
- Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities [15.073507986272027]
We argue that there is a confound posed by the most common method of aggregating subword probabilities into word probabilities.
This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces.
We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word.
arXiv Detail & Related papers (2024-06-16T08:44:56Z) - Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order.
We propose Forced Invalidation to help preserve the importance of word order.
Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z) - BERT, can HE predict contrastive focus? Predicting and controlling
prominence in neural TTS using a language model [29.188684861193092]
We evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on utterances containing contrastive focus.
We also evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.
arXiv Detail & Related papers (2022-07-04T20:43:41Z) - Naturalistic Causal Probing for Morpho-Syntax [76.83735391276547]
We suggest a naturalistic strategy for input-level intervention on real world data in Spanish.
Using our approach, we isolate morpho-syntactic features from counfounders in sentences.
We apply this methodology to analyze causal effects of gender and number on contextualized representations extracted from pre-trained models.
arXiv Detail & Related papers (2022-05-14T11:47:58Z) - Improving Contextual Representation with Gloss Regularized Pre-training [9.589252392388758]
We propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT) to enhance word semantic similarity.
By predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, the word similarity can be explicitly modeled.
Experimental results show that the gloss regularizer benefits BERT in word-level and sentence-level semantic representation.
arXiv Detail & Related papers (2022-05-13T12:50:32Z) - Language Models Explain Word Reading Times Better Than Empirical
Predictability [20.38397241720963]
The traditional approach in cognitive reading research assumes that word predictability from sentence context is best captured by cloze completion probability.
Probability language models provide deeper explanations for syntactic and semantic effects than CCP.
N-gram and RNN probabilities of the present word more consistently predicted reading performance compared with topic models or CCP.
arXiv Detail & Related papers (2022-02-02T16:38:43Z) - Automatically Identifying Semantic Bias in Crowdsourced Natural Language
Inference Datasets [78.6856732729301]
We introduce a model-driven, unsupervised technique to find "bias clusters" in a learned embedding space of hypotheses in NLI datasets.
interventions and additional rounds of labeling can be performed to ameliorate the semantic bias of the hypothesis distribution of a dataset.
arXiv Detail & Related papers (2021-12-16T22:49:01Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.