You should evaluate your language model on marginal likelihood
overtokenisations
- URL: http://arxiv.org/abs/2109.02550v1
- Date: Mon, 6 Sep 2021 15:37:02 GMT
- Title: You should evaluate your language model on marginal likelihood
overtokenisations
- Authors: Kris Cao and Laura Rimell
- Abstract summary: We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
- Score: 5.824498637088864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural language models typically tokenise input text into sub-word units to
achieve an open vocabulary. The standard approach is to use a single canonical
tokenisation at both train and test time. We suggest that this approach is
unsatisfactory and may bottleneck our evaluation of language model performance.
Using only the one-best tokenisation ignores tokeniser uncertainty over
alternative tokenisations, which may hurt model out-of-domain performance.
In this paper, we argue that instead, language models should be evaluated on
their marginal likelihood over tokenisations. We compare different estimators
for the marginal likelihood based on sampling, and show that it is feasible to
estimate the marginal likelihood with a manageable number of samples. We then
evaluate pretrained English and German language models on both the
one-best-tokenisation and marginal perplexities, and show that the marginal
perplexity can be significantly better than the one best, especially on
out-of-domain data. We link this difference in perplexity to the tokeniser
uncertainty as measured by tokeniser entropy. We discuss some implications of
our results for language model training and evaluation, particularly with
regard to tokenisation robustness.
Related papers
- Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LM)
This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.
We develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - On the Proper Treatment of Tokenization in Psycholinguistics [53.960910019072436]
The paper argues that token-level language models should be marginalized into character-level language models before they are used in psycholinguistic studies.
We find various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
arXiv Detail & Related papers (2024-10-03T17:18:03Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors [50.046717886067555]
We show that when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood.
We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
arXiv Detail & Related papers (2024-06-14T17:38:21Z) - Closing the Curious Case of Neural Text Degeneration [91.22954750742183]
We provide a theoretical explanation for the effectiveness of the truncation sampling.
We show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability.
Our evaluations show that our method outperforms its threshold-based counterparts for low-entropy text generation.
arXiv Detail & Related papers (2023-10-02T23:16:25Z) - Assessing Keyness using Permutation Tests [0.0]
We replace the token-by-token sampling model by a model where corpora are samples of documents rather than tokens.
We do not need any assumption on how the tokens are organized within or across documents, and the approach works with basically *any* keyness score.
arXiv Detail & Related papers (2023-08-25T13:52:57Z) - Should you marginalize over possible tokenizations? [13.07994518230055]
We show that the gap in log-likelihood is no larger than 0.5% in most cases.
Our results show that the gap in log-likelihood is no larger than 0.5% in most cases.
arXiv Detail & Related papers (2023-06-30T16:09:01Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.