DeLighT: Deep and Light-weight Transformer
- URL: http://arxiv.org/abs/2008.00623v2
- Date: Thu, 11 Feb 2021 21:30:28 GMT
- Title: DeLighT: Deep and Light-weight Transformer
- Authors: Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer,
Hannaneh Hajishirzi
- Abstract summary: We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters.
DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation, and (2) across blocks using block-wise scaling.
- Score: 116.9850555964728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a deep and light-weight transformer, DeLighT, that delivers
similar or better performance than standard transformer-based models with
significantly fewer parameters. DeLighT more efficiently allocates parameters
both (1) within each Transformer block using the DeLighT transformation, a deep
and light-weight transformation, and (2) across blocks using block-wise
scaling, which allows for shallower and narrower DeLighT blocks near the input
and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks
are 2.5 to 4 times deeper than standard transformer models and yet have fewer
parameters and operations. Experiments on benchmark machine translation and
language modeling tasks show that DeLighT matches or improves the performance
of baseline Transformers with 2 to 3 times fewer parameters on average. Our
source code is available at: \url{https://github.com/sacmehta/delight}
Related papers
- Efficient Visual Transformer by Learnable Token Merging [8.905020033545643]
We propose a novel transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer.
LTM-Transformer is compatible with many popular and compact transformer networks.
It renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers.
arXiv Detail & Related papers (2024-07-21T17:09:19Z) - Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - ByteTransformer: A High-Performance Transformer Boosted for
Variable-Length Inputs [6.9136984255301]
We present ByteTransformer, a high-performance transformer boosted for variable-length inputs.
ByteTransformer surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, XLA, Tencent TurboTransformer and NVIDIA FasterTransformer.
arXiv Detail & Related papers (2022-10-06T16:57:23Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - DeepNet: Scaling Transformers to 1,000 Layers [106.33669415337135]
We introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer.
In-depth theoretical analysis shows that model updates can be bounded in a stable way.
We successfully scale Transformers up to 1,000 layers without difficulty, which is one order of magnitude deeper than previous deep Transformers.
arXiv Detail & Related papers (2022-03-01T15:36:38Z) - Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach.
We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - Lite Transformer with Long-Short Range Attention [31.946796118788285]
We present an efficient mobile NLP architecture, Lite Transformer, to facilitate deploying mobile NLP applications on edge devices.
Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU under constrained resources.
Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years.
arXiv Detail & Related papers (2020-04-24T17:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.