Related papers: Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

URL: http://arxiv.org/abs/2405.18679v1
Date: Wed, 29 May 2024 01:01:19 GMT
Title: Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain
Authors: Juntao Zhang, Kun Bian, Peng Cheng, Wenbo An, Jianning Liu, Jun Zhou,
Abstract summary: State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains.
Score: 9.458951424465605
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model's ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation in both frequency and spatial domains. The introduction of frequency domain information enables ViM to have a global receptive field during scanning. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, we question the necessity of position embedding in ViM and remove it accordingly in Vim-F, which helps to fully utilize the efficient long-sequence modeling capability of ViM. Finally, we redesign a patch embedding for Vim-F, leveraging a convolutional stem to capture more local correlations, further improving the performance of Vim-F. Code is available at: \url{https://github.com/yws-wxs/Vim-F}.

Related papers

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation [23.702783589405236]
Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic (DGSS) We propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead.
arXiv Detail & Related papers (2025-04-04T05:44:45Z)
DAMamba: Vision State Space Model with Dynamic Adaptive Scan [51.81060691414399]
State space models (SSMs) have recently garnered significant attention in computer vision. We propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. Based on DAS, we propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks.
arXiv Detail & Related papers (2025-02-18T08:12:47Z)
Vim4Path: Self-Supervised Vision Mamba for Histopathology Images [9.271739983963458]
This paper proposes leveraging the Vision Mamba (Vim) architecture, inspired by state space models, within the DINO framework for representation learning in computational pathology. We evaluate the performance of Vim against Vision Transformers (ViT) on the Camelyon16 dataset for both patch-level and slide-level classification.
arXiv Detail & Related papers (2024-04-20T00:44:40Z)
LocalMamba: Visual State Space Model with Windowed Selective Scan [45.00004931200446]
Key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies. Our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs.
arXiv Detail & Related papers (2024-03-14T12:32:40Z)
VMamba: Visual State Space Model [92.83984290020891]
VMamba is a vision backbone that works in linear time complexity. At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.
arXiv Detail & Related papers (2024-01-18T17:55:39Z)
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [48.233300343211205]
We propose a new generic vision backbone with bidirectional Mamba blocks (Vim) Vim marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images.
arXiv Detail & Related papers (2024-01-17T18:56:18Z)
ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR) ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance. We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z)
A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z)
MEW-UNet: Multi-axis representation learning in frequency domain for medical image segmentation [13.456935850832565]
We propose Multi-axis External Weights UNet (MEW-UNet) for medical image segmentation (MIS) based on the U-shape architecture. Specifically, our block performs a Fourier transform on the three axes of the input feature and assigns the external weight in the frequency domain. We evaluate our model on four datasets and achieve state-of-the-art performances.
arXiv Detail & Related papers (2022-10-25T13:22:41Z)
Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.