Related papers: The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage

The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage

URL: http://arxiv.org/abs/2506.01172v1
Date: Sun, 01 Jun 2025 21:12:36 GMT
Title: The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage
Authors: Byung-Doh Oh, Hongao Zhu, William Schuler,
Abstract summary: It has been speculated that data leakage may be due to data leakage that caused language models to see the text stimuli during training.<n>This paper presents two studies to address this concern at scale.<n>The first study reveals relatively little leakage of five naturalistic reading time corpora in two pre-training datasets in terms of length and frequency of token $n$-gram overlap.<n>The second study replicates the negative relationship between language model size and the fit of surprisal to reading times using models trained on 'leakage-free' data that overlaps only minimally with the reading time corpora.
Score: 13.411790077470222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In psycholinguistic modeling, surprisal from larger pre-trained language models has been shown to be a poorer predictor of naturalistic human reading times. However, it has been speculated that this may be due to data leakage that caused language models to see the text stimuli during training. This paper presents two studies to address this concern at scale. The first study reveals relatively little leakage of five naturalistic reading time corpora in two pre-training datasets in terms of length and frequency of token $n$-gram overlap. The second study replicates the negative relationship between language model size and the fit of surprisal to reading times using models trained on 'leakage-free' data that overlaps only minimally with the reading time corpora. Taken together, this suggests that previous results using language models trained on these corpora are not driven by the effects of data leakage.

Related papers

Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly [9.45662351979314]
Recent work has observed a positive relationship between Transformer-based models' perplexity and the predictive power of their surprisal estimates on reading times.<n>This study evaluates the predictive power of surprisal estimates from 17 pre-trained Transformer-based models across three different language families on brain imaging data.
arXiv Detail & Related papers (2025-06-12T22:18:48Z)
Frequency Explains the Inverse Correlation of Large Language Models' Size, Training Data Amount, and Surprisal's Fit to Reading Times [15.738530737312335]
Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades. This paper presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends. The results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.
arXiv Detail & Related papers (2024-02-03T20:22:54Z)
Humans and language models diverge when predicting repeating text [52.03471802608112]
We present a scenario in which the performance of humans and LMs diverges. Human and GPT-2 LM predictions are strongly aligned in the first presentation of a text span, but their performance quickly diverges when memory begins to play a role. We hope that this scenario will spur future work in bringing LMs closer to human behavior.
arXiv Detail & Related papers (2023-10-10T08:24:28Z)
Transformer-Based Language Model Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens [17.80735287413141]
We evaluate surprisal estimates from Transformer-based language model variants on their ability to predict human reading times. Results show that surprisal estimates from most variants with contemporary model capacities provide the best fit after seeing about two billion training tokens. Newly-trained smaller model variants reveal a 'tipping point' at convergence, after which the decrease in language model perplexity begins to result in poorer fits to human reading times.
arXiv Detail & Related papers (2023-04-22T12:50:49Z)
Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times? [9.909170013118775]
The propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations. These results suggest that the propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
arXiv Detail & Related papers (2022-12-23T03:57:54Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
What do Large Language Models Learn beyond Language? [10.9650651784511]
We find that pretrained models significantly outperform comparable non-pretrained neural models. Experiments surprisingly reveal that the positive effects of pre-training persist even when pretraining on multi-lingual text or computer code. Our findings suggest a hitherto unexplored deep connection between pre-training and inductive learning abilities of language models.
arXiv Detail & Related papers (2022-10-21T23:43:13Z)
Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text. This leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z)
Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance. We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks. To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long. We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay. Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.