PairConnect: A Compute-Efficient MLP Alternative to Attention
- URL: http://arxiv.org/abs/2106.08235v1
- Date: Tue, 15 Jun 2021 15:39:45 GMT
- Title: PairConnect: A Compute-Efficient MLP Alternative to Attention
- Authors: Zhaozhuo Xu, Minghao Yan, Junyan Zhang, Anshumali Shrivastava
- Abstract summary: We show a memory-heavy but significantly more compute-efficient alternative to Transformer.
Our proposal, denoted as PairConnect, models the pairwise interaction between words by explicit pairwise word embeddings.
Our experiment on language modeling suggests that PairConnect could achieve comparable results with Transformer while reducing the computational cost associated with inference significantly.
- Score: 31.659580535552184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have demonstrated superior performance in natural language
processing. The dot product self-attention in Transformer allows us to model
interactions between words. However, this modeling comes with significant
computational overhead. In this work, we revisit the memory-compute trade-off
associated with Transformer, particularly multi-head attention, and show a
memory-heavy but significantly more compute-efficient alternative to
Transformer. Our proposal, denoted as PairConnect, a multilayer perceptron
(MLP), models the pairwise interaction between words by explicit pairwise word
embeddings. As a result, PairConnect substitutes self dot product with a simple
embedding lookup. We show mathematically that despite being an MLP, our
compute-efficient PairConnect is strictly more expressive than Transformer. Our
experiment on language modeling tasks suggests that PairConnect could achieve
comparable results with Transformer while reducing the computational cost
associated with inference significantly.
Related papers
- Comateformer: Combined Attention Transformer for Semantic Sentence Matching [11.746010399185437]
We propose a novel semantic sentence matching model named Combined Attention Network based on Transformer model (Comateformer)
In Comateformer model, we design a novel transformer-based quasi-attention mechanism with compositional properties.
Our proposed approach builds on the intuition of similarity and dissimilarity (negative affinity) when calculating dual affinity scores.
arXiv Detail & Related papers (2024-12-10T06:18:07Z) - Understanding Factual Recall in Transformers via Associative Memories [55.93756571457904]
We show that shallow transformers can use a combination of associative memories to obtain near optimal storage capacity.
We show that a transformer with a single layer of self-attention followed by an parameters can obtain 100% accuracy on a factual recall task.
arXiv Detail & Related papers (2024-12-09T14:48:14Z) - MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers [43.39466934693055]
We present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective.
This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers.
We conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2024-11-20T02:41:53Z) - ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition [5.311735227179715]
We explore and devise a novel ConvMixFormer architecture for dynamic hand gestures.
The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets.
Our model has achieved state-of-the-art results on single and multimodal inputs.
arXiv Detail & Related papers (2024-11-11T16:45:18Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
We propose Fastformer, which is an efficient Transformer model based on additive attention.
In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts.
In this way, Fastformer can achieve effective context modeling with linear complexity.
arXiv Detail & Related papers (2021-08-20T09:44:44Z) - Subformer: Exploring Weight Sharing for Parameter Efficiency in
Generative Transformers [16.88840622945725]
We develop the Subformer, a parameter efficient Transformer-based model.
Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
arXiv Detail & Related papers (2021-01-01T13:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.