Hierarchical Transformers Are More Efficient Language Models
- URL: http://arxiv.org/abs/2110.13711v1
- Date: Tue, 26 Oct 2021 14:00:49 GMT
- Title: Hierarchical Transformers Are More Efficient Language Models
- Authors: Piotr Nawrot, Szymon Tworkowski, Micha{\l} Tyrolski, {\L}ukasz Kaiser,
Yuhuai Wu, Christian Szegedy, Henryk Michalewski
- Abstract summary: Transformer models yield impressive results on many NLP and sequence modeling tasks.
Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs.
We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
- Score: 19.061388006885686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models yield impressive results on many NLP and sequence modeling
tasks. Remarkably, Transformers can handle long sequences which allows them to
produce long coherent outputs: full paragraphs produced by GPT-3 or
well-structured images produced by DALL-E. These large language models are
impressive but also very inefficient and costly, which limits their
applications and accessibility. We postulate that having an explicit
hierarchical architecture is the key to Transformers that efficiently handle
long sequences. To verify this claim, we first study different ways to
downsample and upsample activations in Transformers so as to make them
hierarchical. We use the best performing upsampling and downsampling layers to
create Hourglass - a hierarchical Transformer language model. Hourglass
improves upon the Transformer baseline given the same amount of computation and
can yield the same results as Transformers more efficiently. In particular,
Hourglass sets new state-of-the-art for Transformer models on the ImageNet32
generation task and improves language modeling efficiency on the widely studied
enwik8 benchmark.
Related papers
- Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition [41.35563331283372]
We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs.
We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
arXiv Detail & Related papers (2023-03-14T04:47:00Z) - Foundation Transformers [105.06915886136524]
We call for the development of Foundation Transformer for true general-purpose modeling.
In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal.
arXiv Detail & Related papers (2022-10-12T17:16:27Z) - SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows.
We design a lightweight yet effective transformer model, called SSformer.
Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z) - Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach.
We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z) - Vis-TOP: Visual Transformer Overlay Processor [9.80151619872144]
Transformer has achieved good results in Natural Language Processing (NLP) and has also started to expand into Computer Vision (CV)
We propose Vis-TOP, an overlay processor for various visual Transformer models.
Vis-TOP summarizes the characteristics of all visual Transformer models and implements a three-layer and two-level transformation structure.
arXiv Detail & Related papers (2021-10-21T08:11:12Z) - Language Modeling using LMUs: 10x Better Data Efficiency or Improved
Scaling Compared to Transformers [4.899818550820576]
We construct a Legendre Memory Unit based model that introduces a general prior for sequence processing.
We show that our new architecture attains the same accuracy as transformers with 10x fewer tokens.
arXiv Detail & Related papers (2021-10-05T23:20:37Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - Efficient Transformers: A Survey [98.23264445730645]
Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning.
This paper characterizes a large and thoughtful selection of recent efficiency-flavored "X-former" models.
arXiv Detail & Related papers (2020-09-14T20:38:14Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.