Why Does Surprisal From Larger Transformer-Based Language Models Provide
a Poorer Fit to Human Reading Times?
- URL: http://arxiv.org/abs/2212.12131v1
- Date: Fri, 23 Dec 2022 03:57:54 GMT
- Title: Why Does Surprisal From Larger Transformer-Based Language Models Provide
a Poorer Fit to Human Reading Times?
- Authors: Byung-Doh Oh, William Schuler
- Abstract summary: The propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
These results suggest that the propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
- Score: 9.909170013118775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents a detailed linguistic analysis into why larger
Transformer-based pre-trained language models with more parameters and lower
perplexity nonetheless yield surprisal estimates that are less predictive of
human reading times. First, regression analyses show a strictly monotonic,
positive log-linear relationship between perplexity and fit to reading times
for the more recently released five GPT-Neo variants and eight OPT variants on
two separate datasets, replicating earlier results limited to just GPT-2 (Oh et
al., 2022). Subsequently, analysis of residual errors reveals a systematic
deviation of the larger variants, such as underpredicting reading times of
named entities and making compensatory overpredictions for reading times of
function words such as modals and conjunctions. These results suggest that the
propensity of larger Transformer-based models to 'memorize' sequences during
training makes their surprisal estimates diverge from humanlike expectations,
which warrants caution in using pre-trained language models to study human
language processing.
Related papers
- Reverse-Engineering the Reader [43.26660964074272]
We introduce a novel alignment technique in which we fine-tune a language model to implicitly optimize the parameters of a linear regressor.
Using words as a test case, we evaluate our technique across multiple model sizes and datasets.
We find an inverse relationship between psychometric power and a model's performance on downstream NLP tasks as well as its perplexity on held-out test data.
arXiv Detail & Related papers (2024-10-16T23:05:01Z) - Frequency Explains the Inverse Correlation of Large Language Models'
Size, Training Data Amount, and Surprisal's Fit to Reading Times [15.738530737312335]
Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades.
This paper presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends.
The results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.
arXiv Detail & Related papers (2024-02-03T20:22:54Z) - Humans and language models diverge when predicting repeating text [52.03471802608112]
We present a scenario in which the performance of humans and LMs diverges.
Human and GPT-2 LM predictions are strongly aligned in the first presentation of a text span, but their performance quickly diverges when memory begins to play a role.
We hope that this scenario will spur future work in bringing LMs closer to human behavior.
arXiv Detail & Related papers (2023-10-10T08:24:28Z) - Structured Radial Basis Function Network: Modelling Diversity for
Multiple Hypotheses Prediction [51.82628081279621]
Multi-modal regression is important in forecasting nonstationary processes or with a complex mixture of distributions.
A Structured Radial Basis Function Network is presented as an ensemble of multiple hypotheses predictors for regression problems.
It is proved that this structured model can efficiently interpolate this tessellation and approximate the multiple hypotheses target distribution.
arXiv Detail & Related papers (2023-09-02T01:27:53Z) - Token-wise Decomposition of Autoregressive Language Model Hidden States
for Analyzing Model Predictions [9.909170013118775]
This work presents a linear decomposition of final hidden states from autoregressive language models based on each initial input token.
Using the change in next-word probability as a measure of importance, this work first examines which context words make the biggest contribution to language model predictions.
arXiv Detail & Related papers (2023-05-17T23:55:32Z) - Transformer-Based Language Model Surprisal Predicts Human Reading Times
Best with About Two Billion Training Tokens [17.80735287413141]
We evaluate surprisal estimates from Transformer-based language model variants on their ability to predict human reading times.
Results show that surprisal estimates from most variants with contemporary model capacities provide the best fit after seeing about two billion training tokens.
Newly-trained smaller model variants reveal a 'tipping point' at convergence, after which the decrease in language model perplexity begins to result in poorer fits to human reading times.
arXiv Detail & Related papers (2023-04-22T12:50:49Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Multilingual Language Models Predict Human Reading Behavior [8.830621849672108]
We compare the performance of language-specific and multilingual pretrained transformer models to predict reading time measures.
We find that BERT and XLM models successfully predict a range of eye tracking features.
In a series of experiments, we analyze the cross-domain and cross-language abilities of these models and show how they reflect human sentence processing.
arXiv Detail & Related papers (2021-04-12T13:03:49Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.