You should evaluate your language model on marginal likelihood
overtokenisations
- URL: http://arxiv.org/abs/2109.02550v1
- Date: Mon, 6 Sep 2021 15:37:02 GMT
- Title: You should evaluate your language model on marginal likelihood
overtokenisations
- Authors: Kris Cao and Laura Rimell
- Abstract summary: We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
- Score: 5.824498637088864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural language models typically tokenise input text into sub-word units to
achieve an open vocabulary. The standard approach is to use a single canonical
tokenisation at both train and test time. We suggest that this approach is
unsatisfactory and may bottleneck our evaluation of language model performance.
Using only the one-best tokenisation ignores tokeniser uncertainty over
alternative tokenisations, which may hurt model out-of-domain performance.
In this paper, we argue that instead, language models should be evaluated on
their marginal likelihood over tokenisations. We compare different estimators
for the marginal likelihood based on sampling, and show that it is feasible to
estimate the marginal likelihood with a manageable number of samples. We then
evaluate pretrained English and German language models on both the
one-best-tokenisation and marginal perplexities, and show that the marginal
perplexity can be significantly better than the one best, especially on
out-of-domain data. We link this difference in perplexity to the tokeniser
uncertainty as measured by tokeniser entropy. We discuss some implications of
our results for language model training and evaluation, particularly with
regard to tokenisation robustness.
Related papers
- Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - Closing the Curious Case of Neural Text Degeneration [91.22954750742183]
We provide a theoretical explanation for the effectiveness of the truncation sampling.
We show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability.
Our evaluations show that our method outperforms its threshold-based counterparts for low-entropy text generation.
arXiv Detail & Related papers (2023-10-02T23:16:25Z) - Assessing Keyness using Permutation Tests [0.0]
We replace the token-by-token sampling model by a model where corpora are samples of documents rather than tokens.
We do not need any assumption on how the tokens are organized within or across documents, and the approach works with basically *any* keyness score.
arXiv Detail & Related papers (2023-08-25T13:52:57Z) - Should you marginalize over possible tokenizations? [13.07994518230055]
We show that the gap in log-likelihood is no larger than 0.5% in most cases.
Our results show that the gap in log-likelihood is no larger than 0.5% in most cases.
arXiv Detail & Related papers (2023-06-30T16:09:01Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - On the Usefulness of Embeddings, Clusters and Strings for Text Generator
Evaluation [86.19634542434711]
Mauve measures an information-theoretic divergence between two probability distributions over strings.
We show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance.
We conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.
arXiv Detail & Related papers (2022-05-31T17:58:49Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Token Drop mechanism for Neural Machine Translation [12.666468105300002]
We propose Token Drop to improve generalization and avoid overfitting for the NMT model.
Similar to word dropout, whereas we replace dropped token with a special token instead of setting zero to words.
arXiv Detail & Related papers (2020-10-21T14:02:27Z) - Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.