HyperMixer: An MLP-based Low Cost Alternative to Transformers
- URL: http://arxiv.org/abs/2203.03691v3
- Date: Mon, 13 Nov 2023 16:39:55 GMT
- Title: HyperMixer: An MLP-based Low Cost Alternative to Transformers
- Authors: Florian Mai, Arnaud Pannatier, Fabio Fehr, Haolin Chen, Francois
Marelli, Francois Fleuret, James Henderson
- Abstract summary: We propose a simple variant, HyperMixer, which forms the token mixing dynamically using hypernetworks.
In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyper tuning.
- Score: 12.785548869229052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based architectures are the model of choice for natural language
understanding, but they come at a significant cost, as they have quadratic
complexity in the input length, require a lot of training data, and can be
difficult to tune. In the pursuit of lower costs, we investigate simple
MLP-based architectures. We find that existing architectures such as MLPMixer,
which achieves token mixing through a static MLP applied to each feature
independently, are too detached from the inductive biases required for natural
language understanding. In this paper, we propose a simple variant, HyperMixer,
which forms the token mixing MLP dynamically using hypernetworks. Empirically,
we demonstrate that our model performs better than alternative MLP-based
models, and on par with Transformers. In contrast to Transformers, HyperMixer
achieves these results at substantially lower costs in terms of processing
time, training data, and hyperparameter tuning.
Related papers
- SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series
Forecasting [13.410217680999459]
Transformers have gained popularity in time series forecasting for their ability to capture long-sequence interactions.
High memory and computing requirements pose a critical bottleneck for long-term forecasting.
We propose TSMixer, a lightweight neural architecture composed of multi-layer perceptron (MLP) modules.
arXiv Detail & Related papers (2023-06-14T06:26:23Z) - iMixer: hierarchical Hopfield network implies an invertible, implicit and iterative MLP-Mixer [2.5782420501870296]
We generalize studies on Hopfield networks and Transformer-like architecture to iMixer.
iMixer is a generalization that propagates forward from the output side to the input side.
We evaluate the model performance with various datasets on image classification tasks.
The results imply that the correspondence between the Hopfield networks and the Mixer models serves as a principle for understanding a broader class of Transformer-like architecture designs.
arXiv Detail & Related papers (2023-04-25T18:00:08Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - MLP Architectures for Vision-and-Language Modeling: An Empirical Study [91.6393550858739]
We initiate the first empirical study on the use of architectures for vision-and-featured (VL) fusion.
We find that without pre-training, usings for multimodal fusion has a noticeable performance gap compared to transformers.
Instead of heavy multi-head attention, adding tiny one-head attention to encoders is sufficient to achieve comparable performance to transformers.
arXiv Detail & Related papers (2021-12-08T18:26:19Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? [0.0]
CNN has reigned supreme in the world of computer vision for the past ten years, but recently, Transformer is on the rise.
In particular, our work indicates that models have the potential to replace CNNs by adopting inductive bias.
The proposed model, named RaftMLP, has a good balance of computational complexity, the number of parameters, and actual memory usage.
arXiv Detail & Related papers (2021-08-09T23:55:24Z) - Pay Attention to MLPs [84.54729425918164]
We show that gMLP can perform as well as Transformers in key language and applications.
Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy.
In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
arXiv Detail & Related papers (2021-05-17T17:55:04Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.