Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems
- URL: http://arxiv.org/abs/2208.08191v1
- Date: Wed, 17 Aug 2022 09:59:22 GMT
- Title: Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems
- Authors: Dan Navon, Alex M. Bronstein
- Abstract summary: We analyze the expressive power of mlp-based architectures in modeling dependencies between multiple inputs simultaneously.
We show an exponential gap between the attention and the mlp-based mechanisms.
Our results suggest a theoretical explanation for the mlp inability to compete with attention-based mechanisms in NLP problems.
- Score: 8.486025595883117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Transformers are widely used in various vision tasks. Meanwhile, there
is another line of works starting with the MLP-mixer trying to achieve similar
performance using mlp-based architectures. Interestingly, until now none
reported using them for NLP tasks, additionally until now non of those
mlp-based architectures claimed to achieve state-of-the-art in vision tasks. In
this paper, we analyze the expressive power of mlp-based architectures in
modeling dependencies between multiple different inputs simultaneously, and
show an exponential gap between the attention and the mlp-based mechanisms. Our
results suggest a theoretical explanation for the mlp inability to compete with
attention-based mechanisms in NLP problems, they also suggest that the
performance gap in vision tasks may be due to the mlp relative weakness in
modeling dependencies between multiple different locations, and that combining
smart input permutations to the mlp architectures may not suffice alone to
close the performance gap.
Related papers
- Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion [20.437172251393257]
We propose a new simple but effective architecture called the Lateralization (L-MLP)
Inspired by the lateralization of the human brain, we propose a new simple but effective architecture called the Lateralization (L-MLP)
arXiv Detail & Related papers (2024-05-25T07:10:02Z) - Improved Baselines for Data-efficient Perceptual Augmentation of LLMs [66.05826802808177]
In computer vision, large language models (LLMs) can be used to prime vision-language tasks such as image captioning and visual question answering.
We present an experimental evaluation of different interfacing mechanisms, across multiple tasks.
We identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
arXiv Detail & Related papers (2024-03-20T10:57:17Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - SpArX: Sparse Argumentative Explanations for Neural Networks [Technical
Report] [14.787292425343527]
We exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of neural networks (NNs)
Our SpArX method first sparsifies the sparse while maintaining as much of the original structure as possible. It then translates, producing global and/or local explanations.
We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of neural networks.
arXiv Detail & Related papers (2023-01-23T17:20:25Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - MAXIM: Multi-Axis MLP for Image Processing [19.192826213493838]
We present a multi-axis based architecture, called MAXIM, that can serve as an efficient general-purpose vision backbone for image processing tasks.
MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gateds.
Results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks.
arXiv Detail & Related papers (2022-01-09T09:59:32Z) - Rethinking Token-Mixing MLP for MLP-based Vision Backbone [34.47616917228978]
We propose an improved structure as termed Circulant Channel-Specific (CCS) token-mixing benchmark, which is spatial-invariant and channel-specific.
It takes fewer parameters but achieves higher classification accuracy on ImageNet1K.
arXiv Detail & Related papers (2021-06-28T17:59:57Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.