An Image Patch is a Wave: Phase-Aware Vision MLP
- URL: http://arxiv.org/abs/2111.12294v2
- Date: Thu, 25 Nov 2021 02:49:10 GMT
- Title: An Image Patch is a Wave: Phase-Aware Vision MLP
- Authors: Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Yanxi Li, Chao Xu, Yunhe
Wang
- Abstract summary: multilayer perceptron (MLP) is a new kind of vision model with extremely simple architecture that only stacked by fully-connected layers.
We propose to represent each token as a wave function with two parts, amplitude and phase.
Experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art architectures on various vision tasks.
- Score: 54.104040163690364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Different from traditional convolutional neural network (CNN) and vision
transformer, the multilayer perceptron (MLP) is a new kind of vision model with
extremely simple architecture that only stacked by fully-connected layers. An
input image of vision MLP is usually split into multiple tokens (patches),
while the existing MLP models directly aggregate them with fixed weights,
neglecting the varying semantic information of tokens from different images. To
dynamically aggregate tokens, we propose to represent each token as a wave
function with two parts, amplitude and phase. Amplitude is the original feature
and the phase term is a complex value changing according to the semantic
contents of input images. Introducing the phase term can dynamically modulate
the relationship between tokens and fixed weights in MLP. Based on the
wave-like token representation, we establish a novel Wave-MLP architecture for
vision tasks. Extensive experiments demonstrate that the proposed Wave-MLP is
superior to the state-of-the-art MLP architectures on various vision tasks such
as image classification, object detection and semantic segmentation.
Related papers
- MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing [123.43419144051703]
We present a novel-like 3D architecture for video recognition.
The results are comparable to state-of-the-art widely-used 3D CNNs and video.
arXiv Detail & Related papers (2022-06-13T16:21:33Z) - ActiveMLP: An MLP-like Architecture with Active Token Mixer [54.95923719553343]
This paper presents ActiveMLP, a general-like backbone for computer vision.
We propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given one.
In this way, the spatial range of token-mixing is expanded and the way of token-mixing is reformed.
arXiv Detail & Related papers (2022-03-11T17:29:54Z) - Dynamic MLP for Fine-Grained Image Classification by Leveraging
Geographical and Temporal Information [19.99135128298929]
Fine-grained image classification is a challenging computer vision task where various species share similar visual appearances.
It is helpful to leverage additional information, e.g., the locations and dates for data shooting, which can be easily accessible but rarely exploited.
We propose a dynamic algorithm on top of the image representation, which interacts with multimodal features at a higher and broader dimension.
arXiv Detail & Related papers (2022-03-07T10:21:59Z) - Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition.
We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction.
We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z) - MAXIM: Multi-Axis MLP for Image Processing [19.192826213493838]
We present a multi-axis based architecture, called MAXIM, that can serve as an efficient general-purpose vision backbone for image processing tasks.
MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gateds.
Results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks.
arXiv Detail & Related papers (2022-01-09T09:59:32Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.