CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing
- URL: http://arxiv.org/abs/2308.13363v2
- Date: Sun, 14 Jan 2024 18:58:08 GMT
- Title: CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing
- Authors: Jonathan Cui, David A. Araujo, Suman Saha, Md. Faisal Kabir
- Abstract summary: We propose a hierarchical Vision that learns dynamic low-rank transformations for spatial-channel mixing through cross-scale local and global aggregation.
Our largest model, CS-Mixer-L, reaches 83.2% top-1 accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.
- Score: 2.1016271540149636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their simpler information fusion designs compared with Vision
Transformers and Convolutional Neural Networks, Vision MLP architectures have
demonstrated strong performance and high data efficiency in recent research.
However, existing works such as CycleMLP and Vision Permutator typically model
spatial information in equal-size spatial regions and do not consider
cross-scale spatial interactions. Further, their token mixers only model 1- or
2-axis correlations, avoiding 3-axis spatial-channel mixing due to its
computational demands. We therefore propose CS-Mixer, a hierarchical Vision MLP
that learns dynamic low-rank transformations for spatial-channel mixing through
cross-scale local and global aggregation. The proposed methodology achieves
competitive results on popular image recognition benchmarks without incurring
substantially more compute. Our largest model, CS-Mixer-L, reaches 83.2% top-1
accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.
Related papers
- Empowering Snapshot Compressive Imaging: Spatial-Spectral State Space Model with Across-Scanning and Local Enhancement [51.557804095896174]
We introduce a State Space Model with Across-Scanning and Local Enhancement, named ASLE-SSM, that employs a Spatial-Spectral SSM for global-local balanced context encoding and cross-channel interaction promoting.
Experimental results illustrate ASLE-SSM's superiority over existing state-of-the-art methods, with an inference speed 2.4 times faster than Transformer-based MST and saving 0.12 (M) of parameters.
arXiv Detail & Related papers (2024-08-01T15:14:10Z) - DeblurDiNAT: A Generalizable Transformer for Perceptual Image Deblurring [1.5124439914522694]
DeblurDiNAT is a generalizable and efficient encoder-decoder Transformer which restores clean images visually close to the ground truth.
We present a linear feed-forward network and a non-linear dual-stage feature fusion module for faster feature propagation across the network.
arXiv Detail & Related papers (2024-03-19T21:31:31Z) - Superpixel Graph Contrastive Clustering with Semantic-Invariant
Augmentations for Hyperspectral Images [64.72242126879503]
Hyperspectral images (HSI) clustering is an important but challenging task.
We first use 3-D and 2-D hybrid convolutional neural networks to extract the high-order spatial and spectral features of HSI.
We then design a superpixel graph contrastive clustering model to learn discriminative superpixel representations.
arXiv Detail & Related papers (2024-03-04T07:40:55Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Mixing and Shifting: Exploiting Global and Local Dependencies in Vision
MLPs [84.3235981545673]
Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks.
We present Mix-Shift-MLP which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting.
MS-MLP achieves competitive performance in multiple vision benchmarks.
arXiv Detail & Related papers (2022-02-14T06:53:48Z) - DynaMixer: A Vision MLP Architecture with Dynamic Mixing [38.23027495545522]
This paper presents an efficient tasks-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion.
We propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing by leveraging the contents of all the tokens to be mixed.
Our proposed DynaMixer model (97M parameters) achieves 84.3% top-1 accuracy on the ImageNet-1K, performing favorably against the state-of-the-art vision models.
arXiv Detail & Related papers (2022-01-28T12:43:14Z) - A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z) - Combining Spatial Clustering with LSTM Speech Models for Multichannel
Speech Enhancement [3.730592618611028]
Recurrent neural networks using the LSTM architecture can achieve significant single-channel noise reduction.
It is not obvious, however, how to apply them to multi-channel inputs in a way that can generalize to new microphone configurations.
This paper combines the two approaches to attain both the spatial separation performance and generality of multichannel spatial clustering.
arXiv Detail & Related papers (2020-12-02T22:37:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.