Efficient Language Modeling with Sparse all-MLP
- URL: http://arxiv.org/abs/2203.06850v2
- Date: Wed, 16 Mar 2022 21:44:06 GMT
- Title: Efficient Language Modeling with Sparse all-MLP
- Authors: Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves
Stoyanov, Xian Li
- Abstract summary: All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
- Score: 53.81435968051093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: All-MLP architectures have attracted increasing interest as an alternative to
attention-based models. In NLP, recent work like gMLP shows that all-MLPs can
match Transformers in language modeling, but still lag behind in downstream
tasks. In this work, we analyze the limitations of MLPs in expressiveness, and
propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature
and input (token) dimensions. Such sparse all-MLPs significantly increase model
capacity and expressiveness while keeping the compute constant. We address
critical challenges in incorporating conditional computation with two routing
strategies. The proposed sparse all-MLP improves language modeling perplexity
and obtains up to 2$\times$ improvement in training efficiency compared to both
Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH
Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its
zero-shot in-context learning performance on six downstream tasks, and find
that it surpasses Transformer-based MoEs and dense Transformers.
Related papers
- MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - NTK-approximating MLP Fusion for Efficient Language Model Fine-tuning [40.994306592119266]
Fine-tuning a pre-trained language model (PLM) emerges as the predominant strategy in many natural language processing applications.
Some general approaches (e.g. quantization and distillation) have been widely studied to reduce the compute/memory of PLM fine-tuning.
We propose to coin a lightweight PLM through NTK-approximating modules in fusion.
arXiv Detail & Related papers (2023-07-18T03:12:51Z) - MLP Architectures for Vision-and-Language Modeling: An Empirical Study [91.6393550858739]
We initiate the first empirical study on the use of architectures for vision-and-featured (VL) fusion.
We find that without pre-training, usings for multimodal fusion has a noticeable performance gap compared to transformers.
Instead of heavy multi-head attention, adding tiny one-head attention to encoders is sufficient to achieve comparable performance to transformers.
arXiv Detail & Related papers (2021-12-08T18:26:19Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - ConvMLP: Hierarchical Convolutional MLPs for Vision [7.874749885641495]
We propose a hierarchical ConMLP: a light-weight, stage-wise, co-design for visual recognition.
We show that ConvMLP can be seamlessly transferred and achieve competitive results with fewer parameters.
arXiv Detail & Related papers (2021-09-09T17:52:57Z) - Sparse-MLP: A Fully-MLP Architecture with Conditional Computation [7.901786481399378]
Mixture-of-Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost.
We propose Sparse-MLP, scaling the recent-Mixer model with MoE, to achieve a more-efficient architecture.
arXiv Detail & Related papers (2021-09-05T06:43:08Z) - Pay Attention to MLPs [84.54729425918164]
We show that gMLP can perform as well as Transformers in key language and applications.
Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy.
In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
arXiv Detail & Related papers (2021-05-17T17:55:04Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.