Dynamic Spectrum Mixer for Visual Recognition
- URL: http://arxiv.org/abs/2309.06721v2
- Date: Fri, 15 Sep 2023 08:39:50 GMT
- Title: Dynamic Spectrum Mixer for Visual Recognition
- Authors: Zhiqiang Hu, Tao Yu
- Abstract summary: We propose a content-adaptive yet computationally efficient structure, dubbed Dynamic Spectrum Mixer (DSM)
DSM represents token interactions in the frequency domain by employing the Cosine Transform.
It can learn long-term spatial dependencies with log-linear complexity.
- Score: 17.180863898764194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, MLP-based vision backbones have achieved promising performance in
several visual recognition tasks. However, the existing MLP-based methods
directly aggregate tokens with static weights, leaving the adaptability to
different images untouched. Moreover, Recent research demonstrates that
MLP-Transformer is great at creating long-range dependencies but ineffective at
catching high frequencies that primarily transmit local information, which
prevents it from applying to the downstream dense prediction tasks, such as
semantic segmentation. To address these challenges, we propose a
content-adaptive yet computationally efficient structure, dubbed Dynamic
Spectrum Mixer (DSM). The DSM represents token interactions in the frequency
domain by employing the Discrete Cosine Transform, which can learn long-term
spatial dependencies with log-linear complexity. Furthermore, a dynamic
spectrum weight generation layer is proposed as the spectrum bands selector,
which could emphasize the informative frequency bands while diminishing others.
To this end, the technique can efficiently learn detailed features from visual
input that contains both high- and low-frequency information. Extensive
experiments show that DSM is a powerful and adaptable backbone for a range of
visual recognition tasks. Particularly, DSM outperforms previous
transformer-based and MLP-based models, on image classification, object
detection, and semantic segmentation tasks, such as 83.8 \% top-1 accuracy on
ImageNet, and 49.9 \% mIoU on ADE20K.
Related papers
- DynaSeg: A Deep Dynamic Fusion Method for Unsupervised Image Segmentation Incorporating Feature Similarity and Spatial Continuity [0.5755004576310334]
We introduce DynaSeg, an innovative unsupervised image segmentation approach.
Unlike traditional methods, DynaSeg employs a dynamic weighting scheme that adapts flexibly to image characteristics.
DynaSeg prevents undersegmentation failures where the number of predicted clusters might converge to one.
arXiv Detail & Related papers (2024-05-09T00:30:45Z) - Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum [13.81570624162769]
We propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC.
First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships.
Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration.
arXiv Detail & Related papers (2024-04-27T10:47:07Z) - SpectralMamba: Efficient Mamba for Hyperspectral Image Classification [39.18999103115206]
Recurrent neural networks and Transformers have dominated most applications in hyperspectral (HS) imaging.
We propose SpectralMamba -- a novel state space model incorporated efficient deep learning framework for HS image classification.
We show that SpectralMamba surprisingly creates promising win-wins from both performance and efficiency perspectives.
arXiv Detail & Related papers (2024-04-12T14:12:03Z) - DiffSpectralNet : Unveiling the Potential of Diffusion Models for
Hyperspectral Image Classification [6.521187080027966]
We propose a new network called DiffSpectralNet, which combines diffusion and transformer techniques.
First, we use an unsupervised learning framework based on the diffusion model to extract both high-level and low-level spectral-spatial features.
The diffusion method is capable of extracting diverse and meaningful spectral-spatial features, leading to improvement in HSI classification.
arXiv Detail & Related papers (2023-10-29T15:26:37Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - Hyperspectral Image Denoising via Self-Modulating Convolutional Neural
Networks [15.700048595212051]
We introduce a self-modulating convolutional neural network which utilizes correlated spectral and spatial information.
At the core of the model lies a novel block, which allows the network to transform the features in an adaptive manner based on the adjacent spectral data.
Experimental analysis on both synthetic and real data shows that the proposed SM-CNN outperforms other state-of-the-art HSI denoising methods.
arXiv Detail & Related papers (2023-09-15T06:57:43Z) - Masked Frequency Modeling for Self-Supervised Visual Pre-Training [102.89756957704138]
We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models.
MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum.
For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
arXiv Detail & Related papers (2022-06-15T17:58:30Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - SpectralFormer: Rethinking Hyperspectral Image Classification with
Transformers [91.09957836250209]
Hyperspectral (HS) images are characterized by approximately contiguous spectral information.
CNNs have been proven to be a powerful feature extractor in HS image classification.
We propose a novel backbone network called ulSpectralFormer for HS image classification.
arXiv Detail & Related papers (2021-07-07T02:59:21Z) - Fourier Features Let Networks Learn High Frequency Functions in Low
Dimensional Domains [69.62456877209304]
We show that passing input points through a simple Fourier feature mapping enables a multilayer perceptron to learn high-frequency functions.
Results shed light on advances in computer vision and graphics that achieve state-of-the-art results.
arXiv Detail & Related papers (2020-06-18T17:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.