Related papers: Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

URL: http://arxiv.org/abs/2510.24963v1
Date: Tue, 28 Oct 2025 20:51:01 GMT
Title: Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale
Authors: James A. Michaelov, Roger P. Levy, Benjamin K. Bergen,
Abstract summary: We show that autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining.<n>We find that up to 98% of the variance in language model behavior at the word level can be explained by three simple probabilities.<n> Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.
Score: 6.372261626436675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.

Related papers

Pretraining Language Models to Ponder in Continuous Space [50.52734567589996]
We introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step.<n>We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
arXiv Detail & Related papers (2025-05-27T03:47:33Z)
A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors [50.046717886067555]
We show that when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood. We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
arXiv Detail & Related papers (2024-06-14T17:38:21Z)
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction [23.45902601618188]
Language models have demonstrated impressive ability in context understanding and generative performance. We propose LMTraj (Language-based Multimodal Trajectory predictor), which recasts the trajectory prediction task into a sort of question-answering problem. We show that the language-based model can be a powerful pedestrian trajectory predictor, and outperforms existing numerical-based predictor methods.
arXiv Detail & Related papers (2024-03-27T11:06:44Z)
Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective. We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention. Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z)
PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications [9.782175445247127]
PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities. Experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance.
arXiv Detail & Related papers (2023-10-26T14:20:44Z)
Meta predictive learning model of languages in neural circuits [2.5690340428649328]
We propose a mean-field learning model within the predictive coding framework. Our model reveals that most of the connections become deterministic after learning. Our model provides a starting point to investigate the connection among brain computation, next-token prediction and general intelligence.
arXiv Detail & Related papers (2023-09-08T03:58:05Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages. Our largest model sets new state of the art in few-shot learning in more than 20 representative languages. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)
Scaling Language Models: Methods, Analysis & Insights from Training Gopher [83.98181046650664]
We present an analysis of Transformer-based language model performance across a wide range of model scales. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language. We discuss the application of language models to AI safety and the mitigation of downstream harms.
arXiv Detail & Related papers (2021-12-08T19:41:47Z)
Modeling the Unigram Distribution [39.153612297712655]
The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. We present a novel model for estimating it in a language.
arXiv Detail & Related papers (2021-06-04T07:02:49Z)
Implicit Representations of Meaning in Neural Language Models [31.71898809435222]
We identify contextual word representations that function as models of entities and situations as they evolve throughout a discourse. Our results indicate that prediction in pretrained neural language models is supported, at least in part, by dynamic representations of meaning and implicit simulation of entity state.
arXiv Detail & Related papers (2021-06-01T19:23:20Z)
Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task. We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z)
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters. We infer the posteriors over such latent variables based on data from seen task-language combinations. Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.