Should you marginalize over possible tokenizations?
- URL: http://arxiv.org/abs/2306.17757v1
- Date: Fri, 30 Jun 2023 16:09:01 GMT
- Title: Should you marginalize over possible tokenizations?
- Authors: Nadezhda Chirkova, Germ\'an Kruszewski, Jos Rozen, Marc Dymetman
- Abstract summary: We show that the gap in log-likelihood is no larger than 0.5% in most cases.
Our results show that the gap in log-likelihood is no larger than 0.5% in most cases.
- Score: 13.07994518230055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive language models (LMs) map token sequences to probabilities.
The usual practice for computing the probability of any character string (e.g.
English sentences) is to first transform it into a sequence of tokens that is
scored by the model. However, there are exponentially many token sequences that
represent any given string. To truly compute the probability of a string one
should marginalize over all tokenizations, which is typically intractable.
Here, we analyze whether the practice of ignoring the marginalization is
justified. To this end, we devise an importance-sampling-based algorithm that
allows us to compute estimates of the marginal probabilities and compare them
to the default procedure in a range of state-of-the-art models and datasets.
Our results show that the gap in log-likelihood is no larger than 0.5% in most
cases, but that it becomes more pronounced for data with long complex words.
Related papers
- Where is the signal in tokenization space? [31.016041295876864]
Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences.
In this paper, we study non-canonical tokenizations.
arXiv Detail & Related papers (2024-08-16T05:56:10Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - How to Compute the Probability of a Word [45.23856093235994]
This paper derives the correct methods for computing word probabilities.
We show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.
arXiv Detail & Related papers (2024-06-20T17:59:42Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Closing the Curious Case of Neural Text Degeneration [91.22954750742183]
We provide a theoretical explanation for the effectiveness of the truncation sampling.
We show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability.
Our evaluations show that our method outperforms its threshold-based counterparts for low-entropy text generation.
arXiv Detail & Related papers (2023-10-02T23:16:25Z) - Compositional Generalization without Trees using Multiset Tagging and
Latent Permutations [121.37328648951993]
We phrase semantic parsing as a two-step process: we first tag each input token with a multiset of output tokens.
Then we arrange the tokens into an output sequence using a new way of parameterizing and predicting permutations.
Our model outperforms pretrained seq2seq models and prior work on realistic semantic parsing tasks.
arXiv Detail & Related papers (2023-05-26T14:09:35Z) - Truncation Sampling as Language Model Desmoothing [115.28983143361681]
Long samples of text from neural language models can be of poor quality.
Truncation sampling algorithms set some words' probabilities to zero at each step.
We introduce $eta$-sampling, which truncates words below an entropy-dependent probability threshold.
arXiv Detail & Related papers (2022-10-27T05:52:35Z) - Robust Multi-Object Tracking by Marginal Inference [92.48078680697311]
Multi-object tracking in videos requires to solve a fundamental problem of one-to-one assignment between objects in adjacent frames.
We present an efficient approach to compute a marginal probability for each pair of objects in real time.
It achieves competitive results on MOT17 and MOT20 benchmarks.
arXiv Detail & Related papers (2022-08-07T14:04:45Z) - You should evaluate your language model on marginal likelihood
overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.