Language Modeling using LMUs: 10x Better Data Efficiency or Improved
Scaling Compared to Transformers
- URL: http://arxiv.org/abs/2110.02402v1
- Date: Tue, 5 Oct 2021 23:20:37 GMT
- Title: Language Modeling using LMUs: 10x Better Data Efficiency or Improved
Scaling Compared to Transformers
- Authors: Narsimha Chilkuri, Eric Hunsberger, Aaron Voelker, Gurshaant Malik,
Chris Eliasmith
- Abstract summary: We construct a Legendre Memory Unit based model that introduces a general prior for sequence processing.
We show that our new architecture attains the same accuracy as transformers with 10x fewer tokens.
- Score: 4.899818550820576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have demonstrated that the performance of transformers on the
task of language modeling obeys a power-law relationship with model size over
six orders of magnitude. While transformers exhibit impressive scaling, their
performance hinges on processing large amounts of data, and their computational
and memory requirements grow quadratically with sequence length. Motivated by
these considerations, we construct a Legendre Memory Unit based model that
introduces a general prior for sequence processing and exhibits an $O(n)$ and
$O(n \ln n)$ (or better) dependency for memory and computation respectively.
Over three orders of magnitude, we show that our new architecture attains the
same accuracy as transformers with 10x fewer tokens. We also show that for the
same amount of training our model improves the loss over transformers about as
much as transformers improve over LSTMs. Additionally, we demonstrate that
adding global self-attention complements our architecture and the augmented
model improves performance even further.
Related papers
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging [34.643717080240584]
We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size.
Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations.
Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models.
arXiv Detail & Related papers (2024-02-04T21:44:09Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - N-Grammer: Augmenting Transformers with latent n-grams [35.39961549040385]
We propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence.
We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer.
arXiv Detail & Related papers (2022-07-13T17:18:02Z) - Hierarchical Transformers Are More Efficient Language Models [19.061388006885686]
Transformer models yield impressive results on many NLP and sequence modeling tasks.
Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs.
We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
arXiv Detail & Related papers (2021-10-26T14:00:49Z) - Primer: Searching for Efficient Transformers for Language Modeling [79.2677566332444]
Training and inference costs of large Transformer models have grown rapidly and become expensive.
Here we aim to reduce the costs of Transformers by searching for a more efficient variant.
We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
arXiv Detail & Related papers (2021-09-17T17:50:39Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.