Related papers: Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

URL: http://arxiv.org/abs/2010.09697v4
Date: Wed, 29 Sep 2021 18:48:40 GMT
Title: Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent
Authors: William Merrill and Vivek Ramanujan and Yoav Goldberg and Roy Schwartz and Noah Smith
Abstract summary: We study the tendency for transformer parameters to grow in magnitude while saturated between these norms during training. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP.
Score: 44.44543743806831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ($\ell_2$ norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.

Related papers

Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL) We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z)
2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias. We leverage an expressive variation of the multidimensional State Space Model. Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z)
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles [24.52890377175555]
We propose a general-purpose training strategy for transformers that can reduce both the memory and computational cost of self-attention by 4 to 8 times during training. We show that an ensemble of sub-models can be formed from the subsampled pathways within a network, which can achieve better performance than its densely attended counterpart.
arXiv Detail & Related papers (2023-06-02T17:28:46Z)
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs) Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them. But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z)
Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
Characterization of anomalous diffusion through convolutional transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion. Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories. We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z)
The Nuts and Bolts of Adopting Transformer in GANs [124.30856952272913]
We investigate the properties of Transformer in the generative adversarial network (GAN) framework for high-fidelity image synthesis. Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G.
arXiv Detail & Related papers (2021-10-25T17:01:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.