Related papers: Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems

Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems

URL: http://arxiv.org/abs/2208.08191v1
Date: Wed, 17 Aug 2022 09:59:22 GMT
Title: Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems
Authors: Dan Navon, Alex M. Bronstein
Abstract summary: We analyze the expressive power of mlp-based architectures in modeling dependencies between multiple inputs simultaneously. We show an exponential gap between the attention and the mlp-based mechanisms. Our results suggest a theoretical explanation for the mlp inability to compete with attention-based mechanisms in NLP problems.
Score: 8.486025595883117
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Transformers are widely used in various vision tasks. Meanwhile, there is another line of works starting with the MLP-mixer trying to achieve similar performance using mlp-based architectures. Interestingly, until now none reported using them for NLP tasks, additionally until now non of those mlp-based architectures claimed to achieve state-of-the-art in vision tasks. In this paper, we analyze the expressive power of mlp-based architectures in modeling dependencies between multiple different inputs simultaneously, and show an exponential gap between the attention and the mlp-based mechanisms. Our results suggest a theoretical explanation for the mlp inability to compete with attention-based mechanisms in NLP problems, they also suggest that the performance gap in vision tasks may be due to the mlp relative weakness in modeling dependencies between multiple different locations, and that combining smart input permutations to the mlp architectures may not suffice alone to close the performance gap.

Related papers

evMLP: An Efficient Event-Driven MLP Architecture for Vision [0.0]
We present evMLP, accompanied by an event-driven local update mechanism.<n>evMLP can independently process patches on images or feature maps via maps.<n>It attains accuracy competitive with state-of-the-art models.
arXiv Detail & Related papers (2025-07-02T17:36:50Z)
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z)
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories [52.57696897619189]
We introduce the Human-Like Mask Modeling Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task.
arXiv Detail & Related papers (2025-03-11T17:08:54Z)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding. Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z)
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment [39.870809905905325]
We propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA) to extract fine-grained visual information. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference.
arXiv Detail & Related papers (2024-10-08T11:41:55Z)
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion [20.437172251393257]
We propose a new simple but effective architecture called the Lateralization (L-MLP) Inspired by the lateralization of the human brain, we propose a new simple but effective architecture called the Lateralization (L-MLP)
arXiv Detail & Related papers (2024-05-25T07:10:02Z)
MLPs Learn In-Context on Regression and Classification Tasks [28.13046236900491]
In-context learning (ICL) is often assumed to be a unique hallmark of Transformer models. We demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Results highlight the unexpected competence of exemplars in a synthetic setting.
arXiv Detail & Related papers (2024-05-24T15:04:36Z)
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs [66.05826802808177]
In computer vision, large language models (LLMs) can be used to prime vision-language tasks such as image captioning and visual question answering. We present an experimental evaluation of different interfacing mechanisms, across multiple tasks. We identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
arXiv Detail & Related papers (2024-03-20T10:57:17Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks. Much less research has been devoted to the channel mixer or feature mixing block (FFN or) We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z)
SpArX: Sparse Argumentative Explanations for Neural Networks [Technical Report] [14.787292425343527]
We exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of neural networks (NNs) Our SpArX method first sparsifies the sparse while maintaining as much of the original structure as possible. It then translates, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of neural networks.
arXiv Detail & Related papers (2023-01-23T17:20:25Z)
Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z)
Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens) We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z)
MAXIM: Multi-Axis MLP for Image Processing [19.192826213493838]
We present a multi-axis based architecture, called MAXIM, that can serve as an efficient general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gateds. Results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks.
arXiv Detail & Related papers (2022-01-09T09:59:32Z)
MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.