Related papers: The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles

The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles

URL: http://arxiv.org/abs/2306.01705v1
Date: Fri, 2 Jun 2023 17:28:46 GMT
Title: The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles
Authors: Md Shamim Hussain, Mohammed J. Zaki and Dharmashankar Subramanian
Abstract summary: We propose a general-purpose training strategy for transformers that can reduce both the memory and computational cost of self-attention by 4 to 8 times during training. We show that an ensemble of sub-models can be formed from the subsampled pathways within a network, which can achieve better performance than its densely attended counterpart.
Score: 24.52890377175555
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Transformers use the dense self-attention mechanism which gives a lot of flexibility for long-range connectivity. Over multiple layers of a deep transformer, the number of possible connectivity patterns increases exponentially. However, very few of these contribute to the performance of the network, and even fewer are essential. We hypothesize that there are sparsely connected sub-networks within a transformer, called information pathways which can be trained independently. However, the dynamic (i.e., input-dependent) nature of these pathways makes it difficult to prune dense self-attention during training. But the overall distribution of these pathways is often predictable. We take advantage of this fact to propose Stochastically Subsampled self-Attention (SSA) - a general-purpose training strategy for transformers that can reduce both the memory and computational cost of self-attention by 4 to 8 times during training while also serving as a regularization method - improving generalization over dense training. We show that an ensemble of sub-models can be formed from the subsampled pathways within a network, which can achieve better performance than its densely attended counterpart. We perform experiments on a variety of NLP, computer vision and graph learning tasks in both generative and discriminative settings to provide empirical evidence for our claims and show the effectiveness of the proposed method.

Related papers

DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer [1.456352735394398]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)<n> Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.<n>These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases.
arXiv Detail & Related papers (2025-06-15T22:42:57Z)
Unified Locomotion Transformer with Simultaneous Sim-to-Real Transfer for Quadrupeds [20.960989649502206]
Unified Locomotion Transformer (ULT) is a new transformer-based framework to unify the processes of knowledge transfer and policy optimization. The policies are optimized with reinforcement learning, next state-action prediction, and action imitation, all in just one training stage, to achieve zero-shot deployment.
arXiv Detail & Related papers (2025-03-12T02:15:13Z)
A Theory for Compressibility of Graph Transformers for Transductive Learning [6.298115235439078]
Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks. All train/test/validation samples are present during training, making them more akin to a semi-supervised task. We establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed.
arXiv Detail & Related papers (2024-11-20T04:20:17Z)
Does learning the right latent variables necessarily improve in-context learning? [13.828665019247444]
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights. In this paper, we investigate the effect of explicitly inferring task latents. We find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance.
arXiv Detail & Related papers (2024-05-29T15:06:10Z)
Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z)
Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems. We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action. We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs) Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them. But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z)
The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task. We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality. This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z)
Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
Effective Adaptation in Multi-Task Co-Training for Unified Autonomous Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks. We find that their performances are sub-optimal or even lag far behind the single-task baseline. We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z)
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent [44.44543743806831]
We study the tendency for transformer parameters to grow in magnitude while saturated between these norms during training. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP.
arXiv Detail & Related papers (2020-10-19T17:40:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.