SplitMixer: Fat Trimmed From MLP-like Models
- URL: http://arxiv.org/abs/2207.10255v2
- Date: Mon, 25 Jul 2022 17:04:19 GMT
- Title: SplitMixer: Fat Trimmed From MLP-like Models
- Authors: Ali Borji and Sikun Lin
- Abstract summary: We present SplitMixer, a simple and lightweight isotropic-like architecture, for visual recognition.
It contains two types of interleaving convolutional operations to mix information across locations (spatial mixing) and channels (channel mixing)
- Score: 53.12472550578278
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present SplitMixer, a simple and lightweight isotropic MLP-like
architecture, for visual recognition. It contains two types of interleaving
convolutional operations to mix information across spatial locations (spatial
mixing) and channels (channel mixing). The first one includes sequentially
applying two depthwise 1D kernels, instead of a 2D kernel, to mix spatial
information. The second one is splitting the channels into overlapping or
non-overlapping segments, with or without shared parameters, and applying our
proposed channel mixing approaches or 3D convolution to mix channel
information. Depending on design choices, a number of SplitMixer variants can
be constructed to balance accuracy, the number of parameters, and speed. We
show, both theoretically and experimentally, that SplitMixer performs on par
with the state-of-the-art MLP-like models while having a significantly lower
number of parameters and FLOPS. For example, without strong data augmentation
and optimization, SplitMixer achieves around 94% accuracy on CIFAR-10 with only
0.28M parameters, while ConvMixer achieves the same accuracy with about 0.6M
parameters. The well-known MLP-Mixer achieves 85.45% with 17.1M parameters. On
CIFAR-100 dataset, SplitMixer achieves around 73% accuracy, on par with
ConvMixer, but with about 52% fewer parameters and FLOPS. We hope that our
results spark further research towards finding more efficient vision
architectures and facilitate the development of MLP-like models. Code is
available at https://github.com/aliborji/splitmixer.
Related papers
- MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding [64.65145700121442]
We introduce MM-Mixing, a multi-modal mixing alignment framework for 3D understanding.
Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder.
We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios.
arXiv Detail & Related papers (2024-05-28T18:44:15Z) - Mixer is more than just a model [23.309064032922507]
This study focuses on the domain of audio recognition, introducing a novel model named Audio Spectrogram Mixer with Roll-Time and Hermit FFT (ASM-RH)
Experimental results demonstrate that ASM-RH is particularly well-suited for audio data and yields promising outcomes across multiple classification tasks.
arXiv Detail & Related papers (2024-02-28T02:45:58Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - DynaMixer: A Vision MLP Architecture with Dynamic Mixing [38.23027495545522]
This paper presents an efficient tasks-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion.
We propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing by leveraging the contents of all the tokens to be mixed.
Our proposed DynaMixer model (97M parameters) achieves 84.3% top-1 accuracy on the ImageNet-1K, performing favorably against the state-of-the-art vision models.
arXiv Detail & Related papers (2022-01-28T12:43:14Z) - Patches Are All You Need? [96.88889685873106]
Vision Transformer (ViT) models may exceed their performance in some settings.
ViTs require the use of patch embeddings, which group together small regions of the image into single input features.
This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
arXiv Detail & Related papers (2022-01-24T16:42:56Z) - PointMixer: MLP-Mixer for Point Cloud Understanding [74.694733918351]
The concept of channel-mixings and token-mixings achieves noticeable performance in visual recognition tasks.
Unlike images, point clouds are inherently sparse, unordered and irregular, which limits the direct use of universal-Mixer for point cloud understanding.
We propose PointMixer, a universal point set operator that facilitates information sharing among unstructured 3D points.
arXiv Detail & Related papers (2021-11-22T13:25:54Z) - Sparse-MLP: A Fully-MLP Architecture with Conditional Computation [7.901786481399378]
Mixture-of-Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost.
We propose Sparse-MLP, scaling the recent-Mixer model with MoE, to achieve a more-efficient architecture.
arXiv Detail & Related papers (2021-09-05T06:43:08Z) - S$^2$-MLP: Spatial-Shift MLP Architecture for Vision [34.47616917228978]
Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation.
In this paper, we propose a novel pure architecture, spatial-shift (S$2$-MLP)
arXiv Detail & Related papers (2021-06-14T15:05:11Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z) - FMix: Enhancing Mixed Sample Data Augmentation [5.820517596386667]
Mixed Sample Data Augmentation (MSDA) has received increasing attention in recent years.
We show that MixUp distorts learned functions in a way that CutMix does not.
We propose FMix, an MSDA that uses random binary masks obtained by applying a threshold to low frequency images.
arXiv Detail & Related papers (2020-02-27T11:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.