MAXIM: Multi-Axis MLP for Image Processing
- URL: http://arxiv.org/abs/2201.02973v1
- Date: Sun, 9 Jan 2022 09:59:32 GMT
- Title: MAXIM: Multi-Axis MLP for Image Processing
- Authors: Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar,
Alan Bovik, Yinxiao Li
- Abstract summary: We present a multi-axis based architecture, called MAXIM, that can serve as an efficient general-purpose vision backbone for image processing tasks.
MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gateds.
Results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks.
- Score: 19.192826213493838
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent progress on Transformers and multi-layer perceptron (MLP) models
provide new network architectural designs for computer vision tasks. Although
these models proved to be effective in many vision tasks such as image
recognition, there remain challenges in adapting them for low-level vision. The
inflexibility to support high-resolution images and limitations of local
attention are perhaps the main bottlenecks for using Transformers and MLPs in
image restoration. In this work we present a multi-axis MLP based architecture,
called MAXIM, that can serve as an efficient and flexible general-purpose
vision backbone for image processing tasks. MAXIM uses a UNet-shaped
hierarchical structure and supports long-range interactions enabled by
spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building
blocks: a multi-axis gated MLP that allows for efficient and scalable spatial
mixing of local and global visual cues, and a cross-gating block, an
alternative to cross-attention, which accounts for cross-feature mutual
conditioning. Both these modules are exclusively based on MLPs, but also
benefit from being both global and `fully-convolutional', two properties that
are desirable for image processing. Our extensive experimental results show
that the proposed MAXIM model achieves state-of-the-art performance on more
than ten benchmarks across a range of image processing tasks, including
denoising, deblurring, deraining, dehazing, and enhancement while requiring
fewer or comparable numbers of parameters and FLOPs than competitive models.
Related papers
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception.
We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective.
Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z) - GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization [21.846935203845728]
Local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models.
The GIM dataset has the following advantages: 1) Large scale, including over one million pairs of AI-manipulated images and real images.
We propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial Block (FSB), and a Multi-window Anomalous Modelling (MWAM) Module.
arXiv Detail & Related papers (2024-06-24T11:10:41Z) - X-MLP: A Patch Embedding-Free MLP Architecture for Vision [4.493200639605705]
Multi-layer perceptron (MLP) architectures for vision have been popular again.
We propose X-MLP, an architecture constructed absolutely upon fully connected layers and free from patch embedding.
X-MLP is tested on ten benchmark datasets, all better performance than other vision models.
arXiv Detail & Related papers (2023-07-02T15:20:25Z) - Caterpillar: A Pure-MLP Architecture with Shifted-Pillars-Concatenation [72.31517616233695]
Shifted-Pillars-Concatenation (SPC) module offers superior local modeling power and performance gains.
We build a pure-MLP architecture called Caterpillar by replacing the convolutional layer with the SPC module in a hybrid model of sMLPNet.
arXiv Detail & Related papers (2023-05-28T06:19:36Z) - BiMLP: Compact Binary Architectures for Vision Multi-Layer Perceptrons [37.28828605119602]
This paper studies the problem of designing compact binary architectures for vision multi-layer perceptrons (MLPs)
We find that previous binarization methods perform poorly due to limited capacity of binary samplings.
We propose to improve the performance of binary mixing and channel mixing (BiMLP) model by enriching the representation ability of binary FC layers.
arXiv Detail & Related papers (2022-12-29T02:43:41Z) - Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems [8.486025595883117]
We analyze the expressive power of mlp-based architectures in modeling dependencies between multiple inputs simultaneously.
We show an exponential gap between the attention and the mlp-based mechanisms.
Our results suggest a theoretical explanation for the mlp inability to compete with attention-based mechanisms in NLP problems.
arXiv Detail & Related papers (2022-08-17T09:59:22Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition.
We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction.
We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z) - An Image Patch is a Wave: Phase-Aware Vision MLP [54.104040163690364]
multilayer perceptron (MLP) is a new kind of vision model with extremely simple architecture that only stacked by fully-connected layers.
We propose to represent each token as a wave function with two parts, amplitude and phase.
Experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art architectures on various vision tasks.
arXiv Detail & Related papers (2021-11-24T06:25:49Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.