Transformer-Based Language Model Surprisal Predicts Human Reading Times
Best with About Two Billion Training Tokens
- URL: http://arxiv.org/abs/2304.11389v2
- Date: Sun, 22 Oct 2023 20:03:54 GMT
- Title: Transformer-Based Language Model Surprisal Predicts Human Reading Times
Best with About Two Billion Training Tokens
- Authors: Byung-Doh Oh, William Schuler
- Abstract summary: We evaluate surprisal estimates from Transformer-based language model variants on their ability to predict human reading times.
Results show that surprisal estimates from most variants with contemporary model capacities provide the best fit after seeing about two billion training tokens.
Newly-trained smaller model variants reveal a 'tipping point' at convergence, after which the decrease in language model perplexity begins to result in poorer fits to human reading times.
- Score: 17.80735287413141
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent psycholinguistic studies have drawn conflicting conclusions about the
relationship between the quality of a language model and the ability of its
surprisal estimates to predict human reading times, which has been speculated
to be due to the large gap in both the amount of training data and model
capacity across studies. The current work aims to consolidate these findings by
evaluating surprisal estimates from Transformer-based language model variants
that vary systematically in the amount of training data and model capacity on
their ability to predict human reading times. The results show that surprisal
estimates from most variants with contemporary model capacities provide the
best fit after seeing about two billion training tokens, after which they begin
to diverge from humanlike expectations. Additionally, newly-trained smaller
model variants reveal a 'tipping point' at convergence, after which the
decrease in language model perplexity begins to result in poorer fits to human
reading times. These results suggest that the massive amount of training data
is mainly responsible for the poorer fit achieved by surprisal from larger
pre-trained language models, and that a certain degree of model capacity is
necessary for Transformer-based language models to capture humanlike
expectations.
Related papers
- Reverse-Engineering the Reader [43.26660964074272]
We introduce a novel alignment technique in which we fine-tune a language model to implicitly optimize the parameters of a linear regressor.
Using words as a test case, we evaluate our technique across multiple model sizes and datasets.
We find an inverse relationship between psychometric power and a model's performance on downstream NLP tasks as well as its perplexity on held-out test data.
arXiv Detail & Related papers (2024-10-16T23:05:01Z) - Frequency Explains the Inverse Correlation of Large Language Models'
Size, Training Data Amount, and Surprisal's Fit to Reading Times [15.738530737312335]
Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades.
This paper presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends.
The results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.
arXiv Detail & Related papers (2024-02-03T20:22:54Z) - Can training neural language models on a curriculum with developmentally
plausible data improve alignment with human reading behavior? [0.2745342790938508]
This paper explores the extent to which the misalignment between empirical and model-predicted behavior can be minimized by training models on more developmentally plausible data.
We trained teacher language models on the BabyLM "strict-small" dataset and used sentence level surprisal estimates from these teacher models to create a curriculum.
We found tentative evidence that our curriculum made it easier for models to acquire linguistic knowledge from the training data.
arXiv Detail & Related papers (2023-11-30T18:03:58Z) - Humans and language models diverge when predicting repeating text [52.03471802608112]
We present a scenario in which the performance of humans and LMs diverges.
Human and GPT-2 LM predictions are strongly aligned in the first presentation of a text span, but their performance quickly diverges when memory begins to play a role.
We hope that this scenario will spur future work in bringing LMs closer to human behavior.
arXiv Detail & Related papers (2023-10-10T08:24:28Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - Why Does Surprisal From Larger Transformer-Based Language Models Provide
a Poorer Fit to Human Reading Times? [9.909170013118775]
The propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
These results suggest that the propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
arXiv Detail & Related papers (2022-12-23T03:57:54Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Dependency-based Mixture Language Models [53.152011258252315]
We introduce the Dependency-based Mixture Language Models.
In detail, we first train neural language models with a novel dependency modeling objective.
We then formulate the next-token probability by mixing the previous dependency modeling probability distributions with self-attention.
arXiv Detail & Related papers (2022-03-19T06:28:30Z) - Scaling Language Models: Methods, Analysis & Insights from Training
Gopher [83.98181046650664]
We present an analysis of Transformer-based language model performance across a wide range of model scales.
Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language.
We discuss the application of language models to AI safety and the mitigation of downstream harms.
arXiv Detail & Related papers (2021-12-08T19:41:47Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Probabilistic Predictions of People Perusing: Evaluating Metrics of
Language Model Performance for Psycholinguistic Modeling [0.8668211481067458]
We re-evaluate a claim due to Goodkind and Bicknell that a language model's ability to model reading times is a linear function of its perplexity.
We show that the proposed relation does not always hold for Long Short-Term Memory networks, Transformers, and pre-trained models.
arXiv Detail & Related papers (2020-09-08T19:12:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.