Tracing the Representation Geometry of Language Models from Pretraining to Post-training
- URL: http://arxiv.org/abs/2509.23024v1
- Date: Sat, 27 Sep 2025 00:46:29 GMT
- Title: Tracing the Representation Geometry of Language Models from Pretraining to Post-training
- Authors: Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, Blake A. Richards,
- Abstract summary: We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training.<n>We uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining.<n>Post-training further transforms geometry: SFT and DPO drive "entropy-seeking" dynamics to integrate specific instructional or preferential data.
- Score: 22.18942718274405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard training metrics like loss fail to explain the emergence of complex capabilities in large language models. We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training, measuring effective rank (RankMe) and eigenspectrum decay ($\alpha$-ReQ). With OLMo (1B-7B) and Pythia (160M-12B) models, we uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining. The initial "warmup" phase exhibits rapid representational collapse. This is followed by an "entropy-seeking" phase, where the manifold's dimensionality expands substantially, coinciding with peak n-gram memorization. Subsequently, a "compression-seeking" phase imposes anisotropic consolidation, selectively preserving variance along dominant eigendirections while contracting others, a transition marked with significant improvement in downstream task performance. We show these phases can emerge from a fundamental interplay of cross-entropy optimization under skewed token frequencies and representational bottlenecks ($d \ll |V|$). Post-training further transforms geometry: SFT and DPO drive "entropy-seeking" dynamics to integrate specific instructional or preferential data, improving in-distribution performance while degrading out-of-distribution robustness. Conversely, RLVR induces "compression-seeking", enhancing reward alignment but reducing generation diversity.
Related papers
- Structural Action Transformer for 3D Dexterous Manipulation [80.07649565189035]
Cross-embodiment skill transfer is a challenge for high-DoF robotic hands.<n>Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations.<n>This paper proposes a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective.
arXiv Detail & Related papers (2026-03-04T11:38:12Z) - Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation [56.361076943802594]
CanonFlow achieves state-of-the-art performance on the challenging GEOM-DRUG dataset, and the advantage remains large in few-step generation.
arXiv Detail & Related papers (2026-02-16T18:58:55Z) - From Core to Detail: Unsupervised Disentanglement with Entropy-Ordered Flows [8.351253396371686]
entropy-ordered flows (EOFlows) order latent dimensions by their explained entropy, analogous to PCA's explained variance.<n> EOFlows build on insights from Independent Mechanism Analysis, Principal Component Flows and Manifold Entropic Metrics.<n>We combine likelihood-based training with local Jacobian regularization and noise augmentation into a method that scales well to high-dimensional data such as images.
arXiv Detail & Related papers (2026-02-06T18:41:03Z) - Scale-Consistent State-Space Dynamics via Fractal of Stationary Transformations [9.983526161001997]
Recent deep learning models increasingly rely on depth without structural guarantees on the validity of intermediate representations.<n>We address this limitation by formulating a structural requirement for state-space model's scale-consistent latent dynamics.<n>We empirically verify the predicted scale-consistent behavior, showing that adaptive efficiency emerges from the aligned latent geometry.
arXiv Detail & Related papers (2026-01-27T12:44:20Z) - Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds [0.5729426778193398]
We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens.<n>We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable "concept basins" to fixed points of this renormalization-like dynamics.<n>The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space.
arXiv Detail & Related papers (2026-01-16T23:11:02Z) - Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds [0.4779196219827507]
We show how cross-entropy training reshapes attention scores and value vectors in a transformer attention head.<n>Our core result is an emphadvantage-based routing law for attention scores.<n>We show that this coupled specialization behaves like a two-timescale EM procedure.
arXiv Detail & Related papers (2025-12-27T05:31:44Z) - OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z) - PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction [87.33016661440202]
Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality.<n>We propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions.<n> Experiments on ShapeNet show that PointNSP establishes state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm.
arXiv Detail & Related papers (2025-10-07T06:31:02Z) - Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss [52.28880405119483]
Unsupervised online 3D instance segmentation is a fundamental yet challenging task.<n>Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity.<n>We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation.
arXiv Detail & Related papers (2025-09-27T08:53:27Z) - Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points [17.339704162468042]
We investigate the loss landscape of transformer models trained on in-context next-token prediction tasks.<n>In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss.
arXiv Detail & Related papers (2025-08-18T11:24:30Z) - Memorization-Compression Cycles Improve Generalization [0.0]
We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations.<n>We propose Gated Phase Transition (GAPT), a training algorithm that switches between memorization and compression phases.<n>In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation.
arXiv Detail & Related papers (2025-05-13T16:37:54Z) - Shrink the longest: improving latent space isotropy with symplicial geometry [0.0]
We propose a novel regularization technique based on simplicial geometry to improve the isotropy of latent representations.<n>We demonstrate that the method leads to an increase in downstream performance while significantly lowering the anisotropy during fine-tuning.
arXiv Detail & Related papers (2025-01-09T18:44:10Z) - Geometric Dynamics of Signal Propagation Predict Trainability of
Transformers [22.25628914395565]
We investigate forward signal propagation and gradient back propagation in deep, randomly transformers.
Our approach treats the evolution of $n tokens as they propagate through the transformer layers.
We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents.
arXiv Detail & Related papers (2024-03-05T01:30:34Z) - Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image
Compression [63.56922682378755]
We focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding.
The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform.
Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.
arXiv Detail & Related papers (2023-08-17T01:34:51Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z) - DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained
Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states.
We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs.
Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z) - Emulating Spatio-Temporal Realizations of Three-Dimensional Isotropic
Turbulence via Deep Sequence Learning Models [24.025975236316842]
We use a data-driven approach to model a three-dimensional turbulent flow using cutting-edge Deep Learning techniques.
The accuracy of the model is assessed using statistical and physics-based metrics.
arXiv Detail & Related papers (2021-12-07T03:33:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.