The Information Pathways Hypothesis: Transformers are Dynamic
Self-Ensembles
- URL: http://arxiv.org/abs/2306.01705v1
- Date: Fri, 2 Jun 2023 17:28:46 GMT
- Title: The Information Pathways Hypothesis: Transformers are Dynamic
Self-Ensembles
- Authors: Md Shamim Hussain, Mohammed J. Zaki and Dharmashankar Subramanian
- Abstract summary: We propose a general-purpose training strategy for transformers that can reduce both the memory and computational cost of self-attention by 4 to 8 times during training.
We show that an ensemble of sub-models can be formed from the subsampled pathways within a network, which can achieve better performance than its densely attended counterpart.
- Score: 24.52890377175555
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformers use the dense self-attention mechanism which gives a lot of
flexibility for long-range connectivity. Over multiple layers of a deep
transformer, the number of possible connectivity patterns increases
exponentially. However, very few of these contribute to the performance of the
network, and even fewer are essential. We hypothesize that there are sparsely
connected sub-networks within a transformer, called information pathways which
can be trained independently. However, the dynamic (i.e., input-dependent)
nature of these pathways makes it difficult to prune dense self-attention
during training. But the overall distribution of these pathways is often
predictable. We take advantage of this fact to propose Stochastically
Subsampled self-Attention (SSA) - a general-purpose training strategy for
transformers that can reduce both the memory and computational cost of
self-attention by 4 to 8 times during training while also serving as a
regularization method - improving generalization over dense training. We show
that an ensemble of sub-models can be formed from the subsampled pathways
within a network, which can achieve better performance than its densely
attended counterpart. We perform experiments on a variety of NLP, computer
vision and graph learning tasks in both generative and discriminative settings
to provide empirical evidence for our claims and show the effectiveness of the
proposed method.
Related papers
- A Theory for Compressibility of Graph Transformers for Transductive Learning [6.298115235439078]
Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks.
All train/test/validation samples are present during training, making them more akin to a semi-supervised task.
We establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed.
arXiv Detail & Related papers (2024-11-20T04:20:17Z) - Does learning the right latent variables necessarily improve in-context learning? [13.828665019247444]
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights.
In this paper, we investigate the effect of explicitly inferring task latents.
We find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance.
arXiv Detail & Related papers (2024-05-29T15:06:10Z) - Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together.
This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique.
In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z) - The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task.
We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality.
This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - Effects of Parameter Norm Growth During Transformer Training: Inductive
Bias from Gradient Descent [44.44543743806831]
We study the tendency for transformer parameters to grow in magnitude while saturated between these norms during training.
As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions.
Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP.
arXiv Detail & Related papers (2020-10-19T17:40:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.