Related papers: Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities

Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities

URL: http://arxiv.org/abs/2406.10851v1
Date: Sun, 16 Jun 2024 08:44:56 GMT
Title: Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities
Authors: Byung-Doh Oh, William Schuler,
Abstract summary: We argue that there is a confound posed by the subword tokenization scheme of language models. We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word.
Score: 15.073507986272027
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Word-by-word conditional probabilities from Transformer-based language models are increasingly being used to evaluate their predictions over minimal pairs or to model the incremental processing difficulty of human readers. In this paper, we argue that there is a confound posed by the subword tokenization scheme of such language models, which has gone unaddressed thus far. This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this can result in word probabilities that sum to more than one, thereby violating the axiom that $\mathsf{P}(\Omega) = 1$. This property results in a misallocation of word-by-word surprisal, where the unacceptability of the current 'end of word' is incorrectly carried over to the next word. Additionally, language models' such implicit prediction of word boundaries is incongruous with psycholinguistic experiments where human subjects directly observe upcoming word boundaries. We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word, which resolves this confound. As a case study, we show that this results in significantly different estimates of garden-path effects in transitive/intransitive sentences, where a comma is strongly expected before the critical word.

Related papers

Causal Estimation of Tokenisation Bias [58.20086589761273]
We quantify the effect of including or not a subword in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters.<n>We find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers.<n> Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times.
arXiv Detail & Related papers (2025-06-03T17:59:47Z)
How to Compute the Probability of a Word [45.23856093235994]
This paper derives the correct methods for computing word probabilities. We show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.
arXiv Detail & Related papers (2024-06-20T17:59:42Z)
Conformal Nucleus Sampling [67.5232384936661]
We assess whether a top-$p$ set is indeed aligned with its probabilistic meaning in various linguistic contexts. We find that OPT models are overconfident, and that calibration shows a moderate inverse scaling with model size.
arXiv Detail & Related papers (2023-05-04T08:11:57Z)
Truncation Sampling as Language Model Desmoothing [115.28983143361681]
Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms set some words' probabilities to zero at each step. We introduce $eta$-sampling, which truncates words below an entropy-dependent probability threshold.
arXiv Detail & Related papers (2022-10-27T05:52:35Z)
Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive. We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z)
Studying word order through iterative shuffling [14.530986799844873]
We show that word order encodes meaning essential to performing NLP benchmark tasks. We use IBIS, a novel, efficient procedure that finds the ordering of a bag of words having the highest likelihood under a fixed language model. We discuss how shuffling inference procedures such as IBIS can benefit language modeling and constrained generation.
arXiv Detail & Related papers (2021-09-10T13:27:06Z)
Modeling the Unigram Distribution [39.153612297712655]
The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. We present a novel model for estimating it in a language.
arXiv Detail & Related papers (2021-06-04T07:02:49Z)
MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction. MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context. We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z)
PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding [16.531103175919924]
We look into the task of emphgeneralizing word embeddings. given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words. We propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding.
arXiv Detail & Related papers (2020-10-21T08:11:08Z)
Exploring BERT's Sensitivity to Lexical Cues using Tests from Semantic Priming [8.08493736237816]
We present a case study analyzing the pre-trained BERT model with tests informed by semantic priming. We find that BERT too shows "priming," predicting a word with greater probability when the context includes a related word versus an unrelated one. Follow-up analysis shows BERT to be increasingly distracted by related prime words as context becomes more informative.
arXiv Detail & Related papers (2020-10-06T20:30:59Z)
Consistency of a Recurrent Language Model With Respect to Incomplete Decoding [67.54760086239514]
We study the issue of receiving infinite-length sequences from a recurrent language model. We propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model.
arXiv Detail & Related papers (2020-02-06T19:56:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.