Related papers: Evaluating Distributional Distortion in Neural Language Modeling

Evaluating Distributional Distortion in Neural Language Modeling

URL: http://arxiv.org/abs/2203.12788v1
Date: Thu, 24 Mar 2022 01:09:46 GMT
Title: Evaluating Distributional Distortion in Neural Language Modeling
Authors: Benjamin LeBrun, Alessandro Sordoni, Timothy J. O'Donnell
Abstract summary: A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language. Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
Score: 81.83408583979745
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A fundamental characteristic of natural language is the high rate at which speakers produce novel expressions. Because of this novelty, a heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language (Baayen, 2001). Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. As a result, we have relatively little understanding of whether neural LMs accurately estimate the probability of sequences in this heavy-tail of rare events. To address this gap, we develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages from which we can exactly compute sequence probabilities. Training LMs on generations from these artificial languages, we compare the sequence-level probability estimates given by LMs to the true probabilities in the target language. Our experiments reveal that LSTM and Transformer language models (i) systematically underestimate the probability of sequences drawn from the target language, and (ii) do so more severely for less-probable sequences. Investigating where this probability mass went, (iii) we find that LMs tend to overestimate the probability of ill formed (perturbed) sequences. In addition, we find that this underestimation behaviour (iv) is weakened, but not eliminated by greater amounts of training data, and (v) is exacerbated for target distributions with lower entropy.

Related papers

Information Locality as an Inductive Bias for Neural Language Models [52.92279412466086]
We show that $m$local entropy are difficult for Transformer and LSTM LMs to learn languages.<n>These results suggest that neurals are highly sensitive to the statistical structure of a language.
arXiv Detail & Related papers (2025-06-05T15:21:05Z)
Deterministic or probabilistic? The psychology of LLMs as random number generators [0.0]
Large Language Models (LLMs) have transformed text generation through inherently probabilistic context-aware mechanisms. Our results reveal that, despite their transformers-based architecture, these models often exhibit deterministic responses when prompted for random numerical outputs.
arXiv Detail & Related papers (2025-02-27T10:45:27Z)
Estimating the Probabilities of Rare Outputs in Language Models [8.585890569162267]
We study low probability estimation in the context of argmax sampling from small transformer language models. We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling. We argue that new methods for low probability estimation are needed to provide stronger guarantees about worst-case performance.
arXiv Detail & Related papers (2024-10-17T04:31:18Z)
On Uncertainty In Natural Language Processing [2.5076643086429993]
This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective. We propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction. Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors.
arXiv Detail & Related papers (2024-10-04T14:08:02Z)
What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages [78.1866280652834]
Large language models (LM) are distributions over strings. We investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs. We find that the complexity of the RLM rank is strong and significant predictors of learnability for both RNNs and Transformers.
arXiv Detail & Related papers (2024-06-06T17:34:24Z)
Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. We develop practical bounds to apply it to language generation. We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z)
A Natural Bias for Language Generation Models [31.44752136404971]
We show that we can endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge. We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly: appears to disentangle strong frequency effects.
arXiv Detail & Related papers (2022-12-19T18:14:36Z)
Out-of-Distribution Detection and Selective Generation for Conditional Language Models [40.15896981028647]
Conditional language models (CLMs) are predominantly trained to classify the next token in an output sequence. We present a highly accurate and lightweight OOD detection method for CLMs. We show how our method can be used under the common and realistic setting of distribution shift for selective generation of high-quality outputs.
arXiv Detail & Related papers (2022-09-30T16:17:11Z)
On the probability-quality paradox in language generation [76.69397802617064]
We analyze language generation through an information-theoretic lens. We posit that human-like language should contain an amount of information close to the entropy of the distribution over natural strings.
arXiv Detail & Related papers (2022-03-31T17:43:53Z)
Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive. We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z)
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters. We infer the posteriors over such latent variables based on data from seen task-language combinations. Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.