Related papers: You can remove GPT2's LayerNorm by fine-tuning

You can remove GPT2's LayerNorm by fine-tuning

URL: http://arxiv.org/abs/2409.13710v2
Date: Sun, 17 Nov 2024 22:32:53 GMT
Title: You can remove GPT2's LayerNorm by fine-tuning
Authors: Stefan Heimersheim,
Abstract summary: LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models. We show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models, and LN or the similar RMSNorm have been used in practically all large language models based on the transformer architecture. The non-linear nature of the LN layers is a hindrance for mechanistic interpretability as it hinders interpretation of the residual stream, and makes it difficult to decompose the model into circuits. Some researchers have gone so far as to name "reasons interpretability researchers hate layer norm." In this paper we show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data. We demonstrate that this LN-free model achieves similar performance to the original model on the OpenWebText and ThePile datasets (-0.05 cross-entropy loss), and the Hellaswag benchmark (-0.5% accuracy). We provide our implementation at https://github.com/ApolloResearch/gpt2_noLN, and fine-tuned GPT2-small models at https://huggingface.co/apollo-research/gpt2_noLN. Our work not only provides a simplified model for mechanistic interpretability research, but also provides evidence that the LN layers, at inference time, do not play a crucial role in transformer models.

Related papers

Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability [0.0]
Layer-wise normalization (LN) is an essential component of transformer-based large language models.<n>LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components.<n>We show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss.
arXiv Detail & Related papers (2025-07-03T12:09:04Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective [32.01426831450348]
The two-stage fine-tuning (FT) method, linear probing (LP) then fine-tuning (LP-FT), outperforms linear probing and FT alone. We analyze the training dynamics of LP-FT for classification tasks on the basis of the neural tangent kernel (NTK) theory. Our study demonstrates the effectiveness of LP-FT for fine-tuning language models.
arXiv Detail & Related papers (2024-05-27T01:31:40Z)
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance [87.19164603145056]
We propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency. We design an anchor-free head solely based on to adapt PETR, enabling better performance with less computational overhead.
arXiv Detail & Related papers (2024-03-08T11:41:48Z)
HuRef: HUman-REadable Fingerprint for Large Language Models [44.9820558213721]
HuRef is a human-readable fingerprint for large language models. It uniquely identifies the base model without interfering with training or exposing model parameters to the public.
arXiv Detail & Related papers (2023-12-08T05:01:47Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
Efficient GPT Model Pre-training using Tensor Train Matrix Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure. The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z)
Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers [20.23085795744602]
We propose textbfAdaptive textbfSparsity textbfLevel (textbfPALS) to automatically seek a decent balance between loss and sparsity. PALS draws inspiration from sparse training and during-training methods. It introduces the novel "expand" mechanism in training sparse neural networks, allowing the model to dynamically shrink, expand, or remain stable to find a proper sparsity level.
arXiv Detail & Related papers (2023-05-28T06:57:27Z)
Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers [98.30298332661323]
This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5. We develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks. Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity.
arXiv Detail & Related papers (2023-05-21T21:06:23Z)
Towards Robust k-Nearest-Neighbor Machine Translation [72.9252395037097]
k-Nearest-Neighbor Machine Translation (kNN-MT) becomes an important research direction of NMT in recent years. Its main idea is to retrieve useful key-value pairs from an additional datastore to modify translations without updating the NMT model. The underlying retrieved noisy pairs will dramatically deteriorate the model performance. We propose a confidence-enhanced kNN-MT model with robust training to alleviate the impact of noise.
arXiv Detail & Related papers (2022-10-17T07:43:39Z)
Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation [18.782750537161615]
We propose to share parameters across all layers thereby leading to a recurrently stacked neural network model. We empirically demonstrate that the translation quality of a model that recurrently stacks a single layer 6 times, despite having significantly fewer parameters, approaches that of a model that stacks 6 layers where each layer has different parameters.
arXiv Detail & Related papers (2021-06-18T08:48:01Z)
Learning to Encode Position for Transformer with Continuous Dynamical Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.