Should you marginalize over possible tokenizations?
- URL: http://arxiv.org/abs/2306.17757v1
- Date: Fri, 30 Jun 2023 16:09:01 GMT
- Title: Should you marginalize over possible tokenizations?
- Authors: Nadezhda Chirkova, Germ\'an Kruszewski, Jos Rozen, Marc Dymetman
- Abstract summary: We show that the gap in log-likelihood is no larger than 0.5% in most cases.
Our results show that the gap in log-likelihood is no larger than 0.5% in most cases.
- Score: 13.07994518230055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive language models (LMs) map token sequences to probabilities.
The usual practice for computing the probability of any character string (e.g.
English sentences) is to first transform it into a sequence of tokens that is
scored by the model. However, there are exponentially many token sequences that
represent any given string. To truly compute the probability of a string one
should marginalize over all tokenizations, which is typically intractable.
Here, we analyze whether the practice of ignoring the marginalization is
justified. To this end, we devise an importance-sampling-based algorithm that
allows us to compute estimates of the marginal probabilities and compare them
to the default procedure in a range of state-of-the-art models and datasets.
Our results show that the gap in log-likelihood is no larger than 0.5% in most
cases, but that it becomes more pronounced for data with long complex words.
Related papers
- DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation [2.4555276449137042]
We propose a family of three new decoding methods by leveraging a mathematical analysis of the token probability distribution.
Our approach consistently performs at least as well as current alternatives in terms of quality and diversity.
arXiv Detail & Related papers (2025-02-19T19:00:02Z) - Language Models Can Predict Their Own Behavior [28.80639362933004]
We show that internal representation of input tokens alone can often precisely predict, not just the next token, but eventual behavior over the entire output sequence.
We leverage this capacity and learn probes on internal states to create early warning (and exit) systems.
Specifically, if the probes can confidently estimate the way the LM is going to behave, then the system will avoid generating tokens altogether and return the estimated behavior instead.
arXiv Detail & Related papers (2025-02-18T23:13:16Z) - Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification [50.717692060500696]
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling.
Next-token prediction can be made robust so as to achieve $C=tilde O(H)$, representing moderate error amplification.
No computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e(log H)1-Omega(1)$.
arXiv Detail & Related papers (2025-02-18T02:52:00Z) - Where is the signal in tokenization space? [31.016041295876864]
Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences.
In this paper, we study non-canonical tokenizations.
arXiv Detail & Related papers (2024-08-16T05:56:10Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Closing the Curious Case of Neural Text Degeneration [91.22954750742183]
We provide a theoretical explanation for the effectiveness of the truncation sampling.
We show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability.
Our evaluations show that our method outperforms its threshold-based counterparts for low-entropy text generation.
arXiv Detail & Related papers (2023-10-02T23:16:25Z) - Truncation Sampling as Language Model Desmoothing [115.28983143361681]
Long samples of text from neural language models can be of poor quality.
Truncation sampling algorithms set some words' probabilities to zero at each step.
We introduce $eta$-sampling, which truncates words below an entropy-dependent probability threshold.
arXiv Detail & Related papers (2022-10-27T05:52:35Z) - Robust Multi-Object Tracking by Marginal Inference [92.48078680697311]
Multi-object tracking in videos requires to solve a fundamental problem of one-to-one assignment between objects in adjacent frames.
We present an efficient approach to compute a marginal probability for each pair of objects in real time.
It achieves competitive results on MOT17 and MOT20 benchmarks.
arXiv Detail & Related papers (2022-08-07T14:04:45Z) - You should evaluate your language model on marginal likelihood
overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.