Evaluating Distributional Distortion in Neural Language Modeling
- URL: http://arxiv.org/abs/2203.12788v1
- Date: Thu, 24 Mar 2022 01:09:46 GMT
- Title: Evaluating Distributional Distortion in Neural Language Modeling
- Authors: Benjamin LeBrun, Alessandro Sordoni, Timothy J. O'Donnell
- Abstract summary: A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language.
Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate.
We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
- Score: 81.83408583979745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A fundamental characteristic of natural language is the high rate at which
speakers produce novel expressions. Because of this novelty, a heavy-tail of
rare events accounts for a significant amount of the total probability mass of
distributions in language (Baayen, 2001). Standard language modeling metrics
such as perplexity quantify the performance of language models (LM) in
aggregate. As a result, we have relatively little understanding of whether
neural LMs accurately estimate the probability of sequences in this heavy-tail
of rare events. To address this gap, we develop a controlled evaluation scheme
which uses generative models trained on natural data as artificial languages
from which we can exactly compute sequence probabilities. Training LMs on
generations from these artificial languages, we compare the sequence-level
probability estimates given by LMs to the true probabilities in the target
language. Our experiments reveal that LSTM and Transformer language models (i)
systematically underestimate the probability of sequences drawn from the target
language, and (ii) do so more severely for less-probable sequences.
Investigating where this probability mass went, (iii) we find that LMs tend to
overestimate the probability of ill formed (perturbed) sequences. In addition,
we find that this underestimation behaviour (iv) is weakened, but not
eliminated by greater amounts of training data, and (v) is exacerbated for
target distributions with lower entropy.
Related papers
- Estimating the Probabilities of Rare Outputs in Language Models [8.585890569162267]
We study low probability estimation in the context of argmax sampling from small transformer language models.
We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling.
We argue that new methods for low probability estimation are needed to provide stronger guarantees about worst-case performance.
arXiv Detail & Related papers (2024-10-17T04:31:18Z) - On Uncertainty In Natural Language Processing [2.5076643086429993]
This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective.
We propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction.
Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors.
arXiv Detail & Related papers (2024-10-04T14:08:02Z) - What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages [78.1866280652834]
Large language models (LM) are distributions over strings.
We investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs.
We find that the complexity of the RLM rank is strong and significant predictors of learnability for both RNNs and Transformers.
arXiv Detail & Related papers (2024-06-06T17:34:24Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - A Natural Bias for Language Generation Models [31.44752136404971]
We show that we can endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge.
We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly: appears to disentangle strong frequency effects.
arXiv Detail & Related papers (2022-12-19T18:14:36Z) - Out-of-Distribution Detection and Selective Generation for Conditional
Language Models [40.15896981028647]
Conditional language models (CLMs) are predominantly trained to classify the next token in an output sequence.
We present a highly accurate and lightweight OOD detection method for CLMs.
We show how our method can be used under the common and realistic setting of distribution shift for selective generation of high-quality outputs.
arXiv Detail & Related papers (2022-09-30T16:17:11Z) - On the probability-quality paradox in language generation [76.69397802617064]
We analyze language generation through an information-theoretic lens.
We posit that human-like language should contain an amount of information close to the entropy of the distribution over natural strings.
arXiv Detail & Related papers (2022-03-31T17:43:53Z) - Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive.
We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.