EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
- URL: http://arxiv.org/abs/2411.15241v1
- Date: Fri, 22 Nov 2024 02:02:06 GMT
- Title: EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
- Authors: Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim,
- Abstract summary: We introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD)
HSM-SSD layer enables the channel mixing operation within hidden states.
As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k.
- Score: 16.576495786546612
- License:
- Abstract: For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens. Yet, efficient vision backbones built with SSM have been explored less. In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states. Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at https://github.com/mlvlab/EfficientViM.
Related papers
- Efficient Self-Supervised Video Hashing with Selective State Spaces [63.83300352372051]
Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval.
We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm.
arXiv Detail & Related papers (2024-12-19T04:33:22Z) - Highly Efficient and Unsupervised Framework for Moving Object Detection in Satellite Videos [0.2023650687546586]
We propose a highly efficient unsupervised framework for SVMOD.
We show that our method can not only process 9 frames per second on 1024x images but also achieve foreground-art performance.
arXiv Detail & Related papers (2024-11-24T16:06:42Z) - CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.
We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications.
Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - DVMSR: Distillated Vision Mamba for Efficient Super-Resolution [7.551130027327461]
We propose DVMSR, a novel lightweight Image SR network that incorporates Vision Mamba and a distillation strategy.
Our proposed DVMSR can outperform state-of-the-art efficient SR methods in terms of model parameters.
arXiv Detail & Related papers (2024-05-05T17:34:38Z) - MambaUIE&SR: Unraveling the Ocean's Secrets with Only 2.8 GFLOPs [1.7648680700685022]
Underwater Image Enhancement (UIE) techniques aim to address the problem of underwater image degradation due to light absorption and scattering.
Recent years, both Convolution Neural Network (CNN)-based and Transformer-based methods have been widely explored.
MambaUIE is able to efficiently synthesize global and local information and maintains a very small number of parameters with high accuracy.
arXiv Detail & Related papers (2024-04-22T05:12:11Z) - EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba [19.062950348441426]
This work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba.
Our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features.
Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks.
arXiv Detail & Related papers (2024-03-15T02:48:47Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - HALSIE: Hybrid Approach to Learning Segmentation by Simultaneously
Exploiting Image and Event Modalities [6.543272301133159]
Event cameras detect changes in per-pixel intensity to generate asynchronous event streams.
They offer great potential for accurate semantic map retrieval in real-time autonomous systems.
Existing implementations for event segmentation suffer from sub-based performance.
We propose hybrid end-to-end learning framework HALSIE to reduce inference cost by up to $20times$ versus art.
arXiv Detail & Related papers (2022-11-19T17:09:50Z) - Fast and High-Quality Image Denoising via Malleable Convolutions [72.18723834537494]
We present Malleable Convolution (MalleConv), as an efficient variant of dynamic convolution.
Unlike previous works, MalleConv generates a much smaller set of spatially-varying kernels from input.
We also build an efficient denoising network using MalleConv, coined as MalleNet.
arXiv Detail & Related papers (2022-01-02T18:35:20Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.