Frequency Explains the Inverse Correlation of Large Language Models'
Size, Training Data Amount, and Surprisal's Fit to Reading Times
- URL: http://arxiv.org/abs/2402.02255v1
- Date: Sat, 3 Feb 2024 20:22:54 GMT
- Title: Frequency Explains the Inverse Correlation of Large Language Models'
Size, Training Data Amount, and Surprisal's Fit to Reading Times
- Authors: Byung-Doh Oh, Shisen Yue, William Schuler
- Abstract summary: Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades.
This paper presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends.
The results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.
- Score: 15.738530737312335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have shown that as Transformer-based language models become
larger and are trained on very large amounts of data, the fit of their
surprisal estimates to naturalistic human reading times degrades. The current
work presents a series of analyses showing that word frequency is a key
explanatory factor underlying these two trends. First, residual errors from
four language model families on four corpora show that the inverse correlation
between model size and fit to reading times is the strongest on the subset of
least frequent words, which is driven by excessively accurate predictions of
larger model variants. Additionally, training dynamics reveal that during later
training steps, all model variants learn to predict rare words and that larger
model variants do so more accurately, which explains the detrimental effect of
both training data amount and model size on fit to reading times. Finally, a
feature attribution analysis demonstrates that larger model variants are able
to accurately predict rare words based on both an effectively longer context
window size as well as stronger local associations compared to smaller model
variants. Taken together, these results indicate that Transformer-based
language models' surprisal estimates diverge from human-like expectations due
to the superhumanly complex associations they learn for predicting rare words.
Related papers
- Transformer-Based Language Model Surprisal Predicts Human Reading Times
Best with About Two Billion Training Tokens [17.80735287413141]
We evaluate surprisal estimates from Transformer-based language model variants on their ability to predict human reading times.
Results show that surprisal estimates from most variants with contemporary model capacities provide the best fit after seeing about two billion training tokens.
Newly-trained smaller model variants reveal a 'tipping point' at convergence, after which the decrease in language model perplexity begins to result in poorer fits to human reading times.
arXiv Detail & Related papers (2023-04-22T12:50:49Z) - Why Does Surprisal From Larger Transformer-Based Language Models Provide
a Poorer Fit to Human Reading Times? [9.909170013118775]
The propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
These results suggest that the propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
arXiv Detail & Related papers (2022-12-23T03:57:54Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Rarely a problem? Language models exhibit inverse scaling in their
predictions following few-type quantifiers [0.6091702876917281]
We focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models.
We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes.
arXiv Detail & Related papers (2022-12-16T20:01:22Z) - Lexical Generalization Improves with Larger Models and Longer Training [42.024050065980845]
We analyze the use of lexical overlaps in natural language inference, paraphrase detection, and reading comprehension.
We find that larger models are much less susceptible to adopting lexical overlaps.
arXiv Detail & Related papers (2022-10-23T09:20:11Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Scaling Language Models: Methods, Analysis & Insights from Training
Gopher [83.98181046650664]
We present an analysis of Transformer-based language model performance across a wide range of model scales.
Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language.
We discuss the application of language models to AI safety and the mitigation of downstream harms.
arXiv Detail & Related papers (2021-12-08T19:41:47Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.