Mnemosyne: Learning to Train Transformers with Transformers
- URL: http://arxiv.org/abs/2302.01128v3
- Date: Fri, 16 Jun 2023 20:15:43 GMT
- Title: Mnemosyne: Learning to Train Transformers with Transformers
- Authors: Deepali Jain, Krzysztof Marcin Choromanski, Avinava Dubey, Sumeet
Singh, Vikas Sindhwani, Tingnan Zhang, Jie Tan
- Abstract summary: We show that Mnemosyne can successfully train Transformers while using simple meta-training strategies that require minimal computational resources.
Mnemosyne provides space comparable complexity to that its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters.
- Score: 18.36543176998175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose a new class of learnable optimizers, called
\textit{Mnemosyne}. It is based on the novel spatio-temporal low-rank implicit
attention Transformers that can learn to train entire neural network
architectures, including other Transformers, without any task-specific
optimizer tuning. We show that Mnemosyne: (a) outperforms popular LSTM
optimizers (also with new feature engineering to mitigate catastrophic
forgetting of LSTMs), (b) can successfully train Transformers while using
simple meta-training strategies that require minimal computational resources,
(c) matches accuracy-wise SOTA hand-designed optimizers with carefully tuned
hyper-parameters (often producing top performing models). Furthermore,
Mnemosyne provides space complexity comparable to that of its hand-designed
first-order counterparts, which allows it to scale to training larger sets of
parameters. We conduct an extensive empirical evaluation of Mnemosyne on: (a)
fine-tuning a wide range of Vision Transformers (ViTs) from medium-size
architectures to massive ViT-Hs (36 layers, 16 heads), (b) pre-training BERT
models and (c) soft prompt-tuning large 11B+ T5XXL models. We complement our
results with a comprehensive theoretical analysis of the compact associative
memory used by Mnemosyne which we believe was never done before.
Related papers
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - On Limitation of Transformer for Learning HMMs [31.128172929754058]
This paper investigates the performance of Transformers in learning Hidden Markov Models (HMMs)
We show that Transformers consistently underperform Recurrent Neural Networks (RNNs) in both training speed and testing accuracy across all tested HMM models.
Our experiments further reveal the relation between the depth of Transformers and the longest sequence length it can effectively learn, based on the types and the complexity of HMMs.
arXiv Detail & Related papers (2024-06-06T13:59:51Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures.
We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Language Modeling using LMUs: 10x Better Data Efficiency or Improved
Scaling Compared to Transformers [4.899818550820576]
We construct a Legendre Memory Unit based model that introduces a general prior for sequence processing.
We show that our new architecture attains the same accuracy as transformers with 10x fewer tokens.
arXiv Detail & Related papers (2021-10-05T23:20:37Z) - Transformer Networks for Trajectory Forecasting [11.802437934289062]
We propose the novel use of Transformer Networks for trajectory forecasting.
This is a fundamental switch from the sequential step-by-step processing of LSTMs to the only-attention-based memory mechanisms of Transformers.
arXiv Detail & Related papers (2020-03-18T09:17:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.