Evidence of Phase Transitions in Small Transformer-Based Language Models
- URL: http://arxiv.org/abs/2511.12768v1
- Date: Sun, 16 Nov 2025 20:37:12 GMT
- Title: Evidence of Phase Transitions in Small Transformer-Based Language Models
- Authors: Noah Hong, Tao Hong,
- Abstract summary: Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs)<n>We ask three complementary questions: Are phase transitions unique to large models, or can they also be observed in small transformer-based language models?<n>Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, and occurring surprisingly early as coherence emerges.
- Score: 0.8081305315045554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors
Related papers
- Evolution of Concepts in Language Model Pre-Training [53.994470178155105]
We track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders.<n>We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages.
arXiv Detail & Related papers (2025-09-21T18:53:12Z) - Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics [56.145578792496714]
Large language models (LLMs) struggle with cross-lingual knowledge transfer.<n>We study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets.
arXiv Detail & Related papers (2025-08-14T18:44:13Z) - Hidden Breakthroughs in Language Model Training [9.183934538035562]
This paper argues that similar breakthroughs occur frequently throughout training but are obscured by a loss metric that collapses all variation into a single scalar.<n>We introduce POLCA, a method for decomposing changes in loss along arbitrary bases of the low-rank training subspace.<n>We validate our method on synthetic arithmetic and natural language tasks, showing that POLCA recovers clusters that represent interpretable breakthroughs in the model's capabilities.
arXiv Detail & Related papers (2025-06-18T20:40:16Z) - Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline? [4.991808275998526]
Building on classic BERTology work, we analyze 25 models spanning from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1)<n>We find that hierarchical organization persists in modern models, with early layers capture syntax, middle layers handle semantics and entity-level information, and later layers encode discourse phenomena.<n>We find that lexical information concentrates linearly in early layers but becomes increasingly nonlinear deeper in the network, while inflectional information remains linearly accessible throughout all layers.
arXiv Detail & Related papers (2025-06-02T18:01:56Z) - How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias [48.9399496805422]
We focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check'<n>Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks.
arXiv Detail & Related papers (2025-05-02T00:07:35Z) - First numerical observation of the Berezinskii-Kosterlitz-Thouless transition in language models [1.4061979259370274]
We numerically demonstrate an unambiguous phase transition in the framework of a natural language model.<n>We identify the phase transition as a variant of the Berezinskii-Kosterlitz-Thouless transition.
arXiv Detail & Related papers (2024-12-02T07:32:32Z) - Unsupervised Representation Learning from Sparse Transformation Analysis [79.94858534887801]
We propose to learn representations from sequence data by factorizing the transformations of the latent variables into sparse components.
Input data are first encoded as distributions of latent activations and subsequently transformed using a probability flow model.
arXiv Detail & Related papers (2024-10-07T23:53:25Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Phase Transitions in the Output Distribution of Large Language Models [0.9374652839580183]
In a physical system, changing parameters such as temperature can induce a phase transition: an abrupt change from one state of matter to another.
The task of identifying phase transitions requires human analysis and some prior understanding of the system to narrow down which low-dimensional properties to monitor and analyze.
Statistical methods for the automated detection of phase transitions from data have recently been proposed within the physics community.
We quantify distributional changes in the generated output via statistical distances, which can be efficiently estimated with access to the probability distribution over next-tokens.
arXiv Detail & Related papers (2024-05-27T12:04:36Z) - Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training [57.771940716189114]
We show that large language models (LLMs) suffer from the "reversal curse"
The root cause of the reversal curse lies in the different word order between the training and inference stage.
We propose Semantic-aware Permutation Training (SPT) to address this issue.
arXiv Detail & Related papers (2024-03-01T18:55:20Z) - On the Effect of Pre-training for Transformer in Different Modality on
Offline Reinforcement Learning [0.0]
We investigate how pre-training on data of different modalities, such as language and vision, affects fine-tuning of Transformer-based models to Mujoco offline reinforcement learning tasks.
arXiv Detail & Related papers (2022-11-17T13:34:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.