CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing
- URL: http://arxiv.org/abs/2308.13363v2
- Date: Sun, 14 Jan 2024 18:58:08 GMT
- Title: CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing
- Authors: Jonathan Cui, David A. Araujo, Suman Saha, Md. Faisal Kabir
- Abstract summary: We propose a hierarchical Vision that learns dynamic low-rank transformations for spatial-channel mixing through cross-scale local and global aggregation.
Our largest model, CS-Mixer-L, reaches 83.2% top-1 accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.
- Score: 2.1016271540149636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their simpler information fusion designs compared with Vision
Transformers and Convolutional Neural Networks, Vision MLP architectures have
demonstrated strong performance and high data efficiency in recent research.
However, existing works such as CycleMLP and Vision Permutator typically model
spatial information in equal-size spatial regions and do not consider
cross-scale spatial interactions. Further, their token mixers only model 1- or
2-axis correlations, avoiding 3-axis spatial-channel mixing due to its
computational demands. We therefore propose CS-Mixer, a hierarchical Vision MLP
that learns dynamic low-rank transformations for spatial-channel mixing through
cross-scale local and global aggregation. The proposed methodology achieves
competitive results on popular image recognition benchmarks without incurring
substantially more compute. Our largest model, CS-Mixer-L, reaches 83.2% top-1
accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.
Related papers
- STEAM: Squeeze and Transform Enhanced Attention Module [1.3370933421481221]
We propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers.
STEAM achieves a 2% increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs.
STEAM outperforms leading modules ECA and GCT in terms of accuracy while achieving a three-fold reduction in GFLOPs.
arXiv Detail & Related papers (2024-12-12T07:38:10Z) - D2-MLP: Dynamic Decomposed MLP Mixer for Medical Image Segmentation [12.470164287197454]
Convolutional neural networks are widely used in various segmentation tasks in medical images.
They are challenged to learn global features adaptively due to the inherent locality of convolutional operations.
We propose a novel Dynamic Decomposed Mixer module to tackle these limitations.
arXiv Detail & Related papers (2024-09-13T15:16:28Z) - GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model [66.35608254724566]
State-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity.
However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks.
Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes.
arXiv Detail & Related papers (2024-07-18T17:59:58Z) - Superpixel Graph Contrastive Clustering with Semantic-Invariant
Augmentations for Hyperspectral Images [64.72242126879503]
Hyperspectral images (HSI) clustering is an important but challenging task.
We first use 3-D and 2-D hybrid convolutional neural networks to extract the high-order spatial and spectral features of HSI.
We then design a superpixel graph contrastive clustering model to learn discriminative superpixel representations.
arXiv Detail & Related papers (2024-03-04T07:40:55Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Mixing and Shifting: Exploiting Global and Local Dependencies in Vision
MLPs [84.3235981545673]
Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks.
We present Mix-Shift-MLP which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting.
MS-MLP achieves competitive performance in multiple vision benchmarks.
arXiv Detail & Related papers (2022-02-14T06:53:48Z) - DynaMixer: A Vision MLP Architecture with Dynamic Mixing [38.23027495545522]
This paper presents an efficient tasks-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion.
We propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing by leveraging the contents of all the tokens to be mixed.
Our proposed DynaMixer model (97M parameters) achieves 84.3% top-1 accuracy on the ImageNet-1K, performing favorably against the state-of-the-art vision models.
arXiv Detail & Related papers (2022-01-28T12:43:14Z) - A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z) - Combining Spatial Clustering with LSTM Speech Models for Multichannel
Speech Enhancement [3.730592618611028]
Recurrent neural networks using the LSTM architecture can achieve significant single-channel noise reduction.
It is not obvious, however, how to apply them to multi-channel inputs in a way that can generalize to new microphone configurations.
This paper combines the two approaches to attain both the spatial separation performance and generality of multichannel spatial clustering.
arXiv Detail & Related papers (2020-12-02T22:37:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.