Related papers: Hidden Dynamics of Massive Activations in Transformer Training

Hidden Dynamics of Massive Activations in Transformer Training

URL: http://arxiv.org/abs/2508.03616v1
Date: Tue, 05 Aug 2025 16:29:51 GMT
Title: Hidden Dynamics of Massive Activations in Transformer Training
Authors: Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos,
Abstract summary: Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations.<n>We present the first comprehensive analysis of massive activation development throughout transformer training.<n>We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins.

Related papers

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
DCIts -- Deep Convolutional Interpreter for time series [0.0]
The model is designed so one can robustly determine the optimal window size that captures all necessary interactions within the smallest possible time frame.<n>It effectively identifies the optimal model order, balancing complexity when incorporating higher-order terms.<n>These advancements hold significant implications for modeling and understanding dynamic systems, making the model a valuable tool for applied and computational physicists.
arXiv Detail & Related papers (2025-01-08T08:21:58Z)
Learning Elementary Cellular Automata with Transformers [3.7013865226473848]
We show that Transformers can learn to abstract and generalize the rules governing Elementary Cellular Automata.<n>Our analysis reveals that including future states or rule prediction in the training loss enhances the models' ability to form internal representations of the rules.
arXiv Detail & Related papers (2024-12-02T11:57:49Z)
Self-Supervised Learning with Generative Adversarial Networks for Electron Microscopy [0.0]
We show how self-supervised pretraining facilitates efficient fine-tuning for a spectrum of downstream tasks. We demonstrate the versatility of self-supervised pretraining across various downstream tasks in the context of electron microscopy.
arXiv Detail & Related papers (2024-02-28T12:25:01Z)
Enhancing Dynamical System Modeling through Interpretable Machine Learning Augmentations: A Case Study in Cathodic Electrophoretic Deposition [0.8796261172196743]
We introduce a comprehensive data-driven framework aimed at enhancing the modeling of physical systems. As a demonstrative application, we pursue the modeling of cathodic electrophoretic deposition (EPD), commonly known as e-coating.
arXiv Detail & Related papers (2024-01-16T14:58:21Z)
Exploring Model Transferability through the Lens of Potential Energy [78.60851825944212]
Transfer learning has become crucial in computer vision tasks due to the vast availability of pre-trained deep learning models. Existing methods for measuring the transferability of pre-trained models rely on statistical correlations between encoded static features and task labels. We present an insightful physics-inspired approach named PED to address these challenges.
arXiv Detail & Related papers (2023-08-29T07:15:57Z)
Active Learning of Discrete-Time Dynamics for Uncertainty-Aware Model Predictive Control [46.81433026280051]
We present a self-supervised learning approach that actively models the dynamics of nonlinear robotic systems. Our approach showcases high resilience and generalization capabilities by consistently adapting to unseen flight conditions.
arXiv Detail & Related papers (2022-10-23T00:45:05Z)
Physics-Inspired Temporal Learning of Quadrotor Dynamics for Accurate Model Predictive Trajectory Tracking [76.27433308688592]
Accurately modeling quadrotor's system dynamics is critical for guaranteeing agile, safe, and stable navigation. We present a novel Physics-Inspired Temporal Convolutional Network (PI-TCN) approach to learning quadrotor's system dynamics purely from robot experience. Our approach combines the expressive power of sparse temporal convolutions and dense feed-forward connections to make accurate system predictions.
arXiv Detail & Related papers (2022-06-07T13:51:35Z)
End-to-End Learning of Hybrid Inverse Dynamics Models for Precise and Compliant Impedance Control [16.88250694156719]
We present a novel hybrid model formulation that enables us to identify fully physically consistent inertial parameters of a rigid body dynamics model. We compare our approach against state-of-the-art inverse dynamics models on a 7 degree of freedom manipulator.
arXiv Detail & Related papers (2022-05-27T07:39:28Z)
Leveraging the structure of dynamical systems for data-driven modeling [111.45324708884813]
We consider the impact of the training set and its structure on the quality of the long-term prediction. We show how an informed design of the training set, based on invariants of the system and the structure of the underlying attractor, significantly improves the resulting models.
arXiv Detail & Related papers (2021-12-15T20:09:20Z)
GEM: Group Enhanced Model for Learning Dynamical Control Systems [78.56159072162103]
We build effective dynamical models that are amenable to sample-based learning. We show that learning the dynamics on a Lie algebra vector space is more effective than learning a direct state transition model. This work sheds light on a connection between learning of dynamics and Lie group properties, which opens doors for new research directions.
arXiv Detail & Related papers (2021-04-07T01:08:18Z)
Physics-Integrated Variational Autoencoders for Robust and Interpretable Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models. We propose a VAE architecture in which a part of the latent space is grounded by physics. We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.