Sparse Universal Transformer
- URL: http://arxiv.org/abs/2310.07096v1
- Date: Wed, 11 Oct 2023 00:38:57 GMT
- Title: Sparse Universal Transformer
- Authors: Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan
- Abstract summary: The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.
This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism.
- Score: 64.78045820484299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Universal Transformer (UT) is a variant of the Transformer that shares
parameters across its layers. Empirical evidence shows that UTs have better
compositional generalization than Vanilla Transformers (VTs) in formal language
tasks. The parameter-sharing also affords it better parameter efficiency than
VTs. Despite its many advantages, scaling UT parameters is much more compute
and memory intensive than scaling up a VT. This paper proposes the Sparse
Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE)
and a new stick-breaking-based dynamic halting mechanism to reduce UT's
computation complexity while retaining its parameter efficiency and
generalization ability. Experiments show that SUT achieves the same performance
as strong baseline models while only using half computation and parameters on
WMT'14 and strong generalization results on formal language tasks (Logical
inference and CFQ). The new halting mechanism also enables around 50\%
reduction in computation during inference with very little performance decrease
on formal language tasks.
Related papers
- HUT: A More Computation Efficient Fine-Tuning Method With Hadamard Updated Transformation [6.954348219088321]
Fine-tuning pre-trained language models for downstream tasks has achieved impressive results in NLP.
Fine-tuning all parameters becomes impractical due to the rapidly increasing size of model parameters.
We propose the direct Updated Transformation (UT) paradigm, which constructs a transformation directly from the original to the updated parameters.
arXiv Detail & Related papers (2024-09-20T13:42:17Z) - ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections [59.839926875976225]
We propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections.
In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters.
arXiv Detail & Related papers (2024-05-30T17:26:02Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - Attention Is Not All You Need Anymore [3.9693969407364427]
We propose a family of drop-in replacements for the self-attention mechanism in the Transformer.
Experimental results show that replacing the self-attention mechanism with the SHE evidently improves the performance of the Transformer.
The proposed Extractors have the potential or are able to run faster than the self-attention mechanism.
arXiv Detail & Related papers (2023-08-15T09:24:38Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Subformer: Exploring Weight Sharing for Parameter Efficiency in
Generative Transformers [16.88840622945725]
We develop the Subformer, a parameter efficient Transformer-based model.
Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
arXiv Detail & Related papers (2021-01-01T13:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.