Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points
- URL: http://arxiv.org/abs/2508.12837v2
- Date: Tue, 19 Aug 2025 09:36:39 GMT
- Title: Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points
- Authors: Aditya Varre, Gizem Yüce, Nicolas Flammarion,
- Abstract summary: We investigate the loss landscape of transformer models trained on in-context next-token prediction tasks.<n>In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss.
- Score: 17.339704162468042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent $k$-gram estimators (for $k \leq n$), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub-$n$-grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of $n$-grams, characterized by discrete transitions between near-stationary solutions.
Related papers
- Scale-Consistent State-Space Dynamics via Fractal of Stationary Transformations [9.983526161001997]
Recent deep learning models increasingly rely on depth without structural guarantees on the validity of intermediate representations.<n>We address this limitation by formulating a structural requirement for state-space model's scale-consistent latent dynamics.<n>We empirically verify the predicted scale-consistent behavior, showing that adaptive efficiency emerges from the aligned latent geometry.
arXiv Detail & Related papers (2026-01-27T12:44:20Z) - Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds [0.5729426778193398]
We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens.<n>We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable "concept basins" to fixed points of this renormalization-like dynamics.<n>The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space.
arXiv Detail & Related papers (2026-01-16T23:11:02Z) - Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility [90.894232610821]
We analyze Transformers through the lens of rank structure.<n>We show that time-series embeddings exhibit sharply decaying singular value spectra.<n>We prove that the associated $Q/K/V$ projections admit accurate low-rank approximations.
arXiv Detail & Related papers (2025-10-02T23:56:17Z) - Tracing the Representation Geometry of Language Models from Pretraining to Post-training [22.18942718274405]
We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training.<n>We uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining.<n>Post-training further transforms geometry: SFT and DPO drive "entropy-seeking" dynamics to integrate specific instructional or preferential data.
arXiv Detail & Related papers (2025-09-27T00:46:29Z) - Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z) - On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Relative Representations: Topological and Geometric Perspectives [50.85040046976025]
Relative representations are an established approach to zero-shot model stitching.<n>We introduce a normalization procedure in the relative transformation, resulting in invariance to non-isotropic rescalings and permutations.<n>Second, we propose to deploy topological densification when fine-tuning relative representations, a topological regularization loss encouraging clustering within classes.
arXiv Detail & Related papers (2024-09-17T08:09:22Z) - What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks.
This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z) - Geometric Dynamics of Signal Propagation Predict Trainability of
Transformers [22.25628914395565]
We investigate forward signal propagation and gradient back propagation in deep, randomly transformers.
Our approach treats the evolution of $n tokens as they propagate through the transformer layers.
We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents.
arXiv Detail & Related papers (2024-03-05T01:30:34Z) - Dynamical versus Bayesian Phase Transitions in a Toy Model of
Superposition [2.3249139042158853]
We investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT)
We present supporting theory indicating that the local learning coefficient determines phase transitions in the Bayesian posterior as a function of training sample size.
The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism.
arXiv Detail & Related papers (2023-10-10T04:26:04Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Regularization, early-stopping and dreaming: a Hopfield-like setup to
address generalization and overfitting [0.0]
We look for optimal network parameters by applying a gradient descent over a regularized loss function.
Within this framework, the optimal neuron-interaction matrices correspond to Hebbian kernels revised by a reiterated unlearning protocol.
arXiv Detail & Related papers (2023-08-01T15:04:30Z) - Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables.
We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST.
We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z) - Segmentation of high dimensional means over multi-dimensional change
points and connections to regression trees [1.0660480034605242]
This article provides a new analytically tractable and fully frequentist framework to characterize and implement regression trees.
The connection to regression trees is made by a high dimensional model with dynamic mean vectors over multi-dimensional change axes.
Results are obtained under a high dimensional scaling $slog2 p=o(T_wT_h), where $p$ is the response dimension, $s$ is a sparsity parameter, and $T_w,T_h$ are sampling periods along change axes.
arXiv Detail & Related papers (2021-05-20T20:29:48Z) - On Long-Tailed Phenomena in Neural Machine Translation [50.65273145888896]
State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens.
We propose a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation.
We show the efficacy of the proposed technique on a number of Machine Translation (MT) datasets, demonstrating that it leads to significant gains over cross-entropy.
arXiv Detail & Related papers (2020-10-10T07:00:57Z) - Mean-field entanglement transitions in random tree tensor networks [0.0]
Entanglement phase transitions in quantum chaotic systems have emerged as a new class of critical points separating phases with different entanglement scaling.
We propose a mean-field theory of such transitions by studying the entanglement properties of random tree tensor networks.
arXiv Detail & Related papers (2020-03-02T19:00:19Z) - Phase Transitions for the Information Bottleneck in Representation
Learning [14.381429281068565]
In the Information Bottleneck (IB), when tuning the relative strength between compression and prediction terms, how do the two terms behave, and what's their relationship with the dataset and the learned representation?
We introduce a definition for IB phase transitions as a qualitative change of the IB loss landscape, and show that the transitions correspond to the onset of learning new classes.
Using second-order calculus of variations, we derive a formula that provides a practical condition for IB phase transitions, and draw its connection with the Fisher information matrix for parameterized models.
arXiv Detail & Related papers (2020-01-07T03:55:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.