Shatter: An Efficient Transformer Encoder with Single-Headed
Self-Attention and Relative Sequence Partitioning
- URL: http://arxiv.org/abs/2108.13032v1
- Date: Mon, 30 Aug 2021 07:42:12 GMT
- Title: Shatter: An Efficient Transformer Encoder with Single-Headed
Self-Attention and Relative Sequence Partitioning
- Authors: Ran Tian, Joshua Maynez, Ankur P. Parikh
- Abstract summary: Transformer architecture, based on self-attention, is the foundation of large pretrained models such as BERT.
We present an alternative self-attention architecture, Shatter, that more efficiently encodes sequence information.
We conduct extensive experiments showing that Shatter achieves better performance than BERT.
- Score: 14.164984597158501
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The highly popular Transformer architecture, based on self-attention, is the
foundation of large pretrained models such as BERT, that have become an
enduring paradigm in NLP. While powerful, the computational resources and time
required to pretrain such models can be prohibitive. In this work, we present
an alternative self-attention architecture, Shatter, that more efficiently
encodes sequence information by softly partitioning the space of relative
positions and applying different value matrices to different parts of the
sequence. This mechanism further allows us to simplify the multi-headed
attention in Transformer to single-headed. We conduct extensive experiments
showing that Shatter achieves better performance than BERT, with pretraining
being faster per step (15% on TPU), converging in fewer steps, and offering
considerable memory savings (>50%). Put together, Shatter can be pretrained on
8 V100 GPUs in 7 days, and match the performance of BERT_Base -- making the
cost of pretraining much more affordable.
Related papers
- Symmetric Dot-Product Attention for Efficient Training of BERT Language Models [5.838117137253223]
We propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture.
When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation.
arXiv Detail & Related papers (2024-06-10T15:24:15Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - A Time Series is Worth 64 Words: Long-term Forecasting with Transformers [4.635547236305835]
We propose an efficient design of Transformer-based models for time series forecasting and self-supervised representation learning.
It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer.
PatchTST can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models.
arXiv Detail & Related papers (2022-11-27T05:15:42Z) - Primer: Searching for Efficient Transformers for Language Modeling [79.2677566332444]
Training and inference costs of large Transformer models have grown rapidly and become expensive.
Here we aim to reduce the costs of Transformers by searching for a more efficient variant.
We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
arXiv Detail & Related papers (2021-09-17T17:50:39Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.