Sparse Universal Transformer
- URL: http://arxiv.org/abs/2310.07096v1
- Date: Wed, 11 Oct 2023 00:38:57 GMT
- Title: Sparse Universal Transformer
- Authors: Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan
- Abstract summary: The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.
This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism.
- Score: 64.78045820484299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Universal Transformer (UT) is a variant of the Transformer that shares
parameters across its layers. Empirical evidence shows that UTs have better
compositional generalization than Vanilla Transformers (VTs) in formal language
tasks. The parameter-sharing also affords it better parameter efficiency than
VTs. Despite its many advantages, scaling UT parameters is much more compute
and memory intensive than scaling up a VT. This paper proposes the Sparse
Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE)
and a new stick-breaking-based dynamic halting mechanism to reduce UT's
computation complexity while retaining its parameter efficiency and
generalization ability. Experiments show that SUT achieves the same performance
as strong baseline models while only using half computation and parameters on
WMT'14 and strong generalization results on formal language tasks (Logical
inference and CFQ). The new halting mechanism also enables around 50\%
reduction in computation during inference with very little performance decrease
on formal language tasks.
Related papers
- Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking [51.154226183713405]
We propose Inner Thinking Transformer, which reimagines layer computations as implicit thinking steps.
ITT achieves 96.5% performance of a 466M Transformer using only 162M parameters, reduces training data by 43.2%, and outperforms Transformer/Loop variants in 11 benchmarks.
arXiv Detail & Related papers (2025-02-19T16:02:23Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.
Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - HUT: A More Computation Efficient Fine-Tuning Method With Hadamard Updated Transformation [6.954348219088321]
Fine-tuning pre-trained language models for downstream tasks has achieved impressive results in NLP.
Fine-tuning all parameters becomes impractical due to the rapidly increasing size of model parameters.
We propose the direct Updated Transformation (UT) paradigm, which constructs a transformation directly from the original to the updated parameters.
arXiv Detail & Related papers (2024-09-20T13:42:17Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Attention Is Not All You Need Anymore [3.9693969407364427]
We propose a family of drop-in replacements for the self-attention mechanism in the Transformer.
Experimental results show that replacing the self-attention mechanism with the SHE evidently improves the performance of the Transformer.
The proposed Extractors have the potential or are able to run faster than the self-attention mechanism.
arXiv Detail & Related papers (2023-08-15T09:24:38Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Subformer: Exploring Weight Sharing for Parameter Efficiency in
Generative Transformers [16.88840622945725]
We develop the Subformer, a parameter efficient Transformer-based model.
Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
arXiv Detail & Related papers (2021-01-01T13:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.