Related papers: Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability

Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability

URL: http://arxiv.org/abs/2507.02559v1
Date: Thu, 03 Jul 2025 12:09:04 GMT
Title: Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability
Authors: Luca Baroni, Galvin Khara, Joachim Schaeffer, Marat Subkhankulov, Stefan Heimersheim,
Abstract summary: Layer-wise normalization (LN) is an essential component of transformer-based large language models.<n>LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components.<n>We show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Layer-wise normalization (LN) is an essential component of virtually all transformer-based large language models. While its effects on training stability are well documented, its role at inference time is poorly understood. Additionally, LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components. Here, we show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss (e.g. +0.03 cross-entropy loss for GPT-2 XL). Thus, LN cannot play a substantial role in language modeling. We find that the amount of fine-tuning data needed for LN removal grows sublinearly with model parameters, suggesting scaling to larger models is feasible. We release a suite of LN-free GPT-2 models on Hugging Face. Furthermore, we test interpretability techniques on LN-free models. Direct logit attribution now gives the exact direct effect of individual components, while the accuracy of attribution patching does not significantly improve. We also confirm that GPT-2's "confidence neurons" are inactive in the LN-free models. Our work clarifies the role of LN layers in language modeling, showing that GPT-2-class models can function without LN layers. We hope that our LN-free analogs of the GPT-2 family of models will enable more precise interpretability research and improve our understanding of language models.

Related papers

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
Dynamic Gradient Sparsification Training for Few-Shot Fine-tuning of CT Lymph Node Segmentation Foundation Model [9.229857825454365]
lymph node (LN) segmentation is critical in radiotherapy treatment and prognosis analysis, but is limited by the need for large annotated datasets.<n>In this work, we annotated 36,106 visible LNs from 3,346 publicly available head-and-neck CT scans to establish a robust LN segmentation model (nnUNetv2)<n>We propose Dynamic Gradient Sparsification Training (D GST), a few-shot fine-tuning approach that preserves foundational knowledge while dynamically updating the most critical parameters of the LN segmentation model with few annotations.
arXiv Detail & Related papers (2025-03-02T06:02:34Z)
You can remove GPT2's LayerNorm by fine-tuning [0.0]
LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models. We show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data.
arXiv Detail & Related papers (2024-09-06T16:17:06Z)
On the Nonlinearity of Layer Normalization [5.0464797863553414]
We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We show that, given $m$ samples with any label assignment, an LN-Net with only 3 neurons in each layer and $O(m)$ LN layers can correctly classify them.
arXiv Detail & Related papers (2024-06-03T12:11:34Z)
Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective [32.01426831450348]
The two-stage fine-tuning (FT) method, linear probing (LP) then fine-tuning (LP-FT), outperforms linear probing and FT alone. We analyze the training dynamics of LP-FT for classification tasks on the basis of the neural tangent kernel (NTK) theory. Our study demonstrates the effectiveness of LP-FT for fine-tuning language models.
arXiv Detail & Related papers (2024-05-27T01:31:40Z)
When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models [61.363259848264725]
Inheritune is a simple and effective training recipe for building smaller, more efficient language models.<n>We show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts.
arXiv Detail & Related papers (2024-04-12T17:53:34Z)
Locality Sensitive Sparse Encoding for Learning World Models Online [29.124825481348285]
Follow-The-Leader world models are desirable for model-based reinforcement learning. FTL models need re-training on all accumulated data at every interaction step to achieve FTL. We show that our world models learned online using a single pass of trajectory data either surpass or match the performance of deep world models trained with replay.
arXiv Detail & Related papers (2024-01-23T19:00:02Z)
ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection [152.01257690637064]
We show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. We propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improves the training stability of UNet.
arXiv Detail & Related papers (2023-10-20T14:45:52Z)
Efficient GPT Model Pre-training using Tensor Train Matrix Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure. The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z)
Towards Robust k-Nearest-Neighbor Machine Translation [72.9252395037097]
k-Nearest-Neighbor Machine Translation (kNN-MT) becomes an important research direction of NMT in recent years. Its main idea is to retrieve useful key-value pairs from an additional datastore to modify translations without updating the NMT model. The underlying retrieved noisy pairs will dramatically deteriorate the model performance. We propose a confidence-enhanced kNN-MT model with robust training to alleviate the impact of noise.
arXiv Detail & Related papers (2022-10-17T07:43:39Z)
A Kernel-Based View of Language Model Fine-Tuning [94.75146965041131]
We investigate whether the Neural Tangent Kernel (NTK) describes fine-tuning of pre-trained LMs. We show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning.
arXiv Detail & Related papers (2022-10-11T17:34:32Z)
On Feature Learning in Neural Networks with Global Convergence Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF) We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z)
Symbolic Learning to Optimize: Towards Interpretability and Scalability [113.23813868412954]
Recent studies on Learning to Optimize (L2O) suggest a promising path to automating and accelerating the optimization procedure for complicated tasks. Existing L2O models parameterize optimization rules by neural networks, and learn those numerical rules via meta-training. In this paper, we establish a holistic symbolic representation and analysis framework for L2O. We propose a lightweight L2O model that can be meta-trained on large-scale problems and outperformed human-designed and tuneds.
arXiv Detail & Related papers (2022-03-13T06:04:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.