Related papers: Distillation-free Scaling of Large SSMs for Images and Videos

Distillation-free Scaling of Large SSMs for Images and Videos

URL: http://arxiv.org/abs/2409.11867v1
Date: Wed, 18 Sep 2024 10:48:10 GMT
Title: Distillation-free Scaling of Large SSMs for Images and Videos
Authors: Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall,
Abstract summary: State-space models (SSMs) have introduced a novel context modeling method by integrating state-space techniques into deep learning. Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. We propose a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance.
Score: 27.604572990625144
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.

Related papers

Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z)
DefMamba: Deformable Visual State Space Model [65.50381013020248]
We propose a novel visual foundation model called DefMamba. By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks.
arXiv Detail & Related papers (2025-04-08T08:22:54Z)
RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing [28.488986896516284]
RoMA is a framework that enables scalable self-supervised pretraining of RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy. experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency.
arXiv Detail & Related papers (2025-03-13T14:09:18Z)
Detail Matters: Mamba-Inspired Joint Unfolding Network for Snapshot Spectral Compressive Imaging [40.80197280147993]
We propose a Mamba-inspired Joint Unfolding Network (MiJUN) to overcome the inherent nonlinear and ill-posed characteristics of HSI reconstruction. We introduce an accelerated unfolding network scheme, which reduces the reliance on initial optimization stages. We refine the scanning strategy with Mamba by integrating the tensor mode-$k$ unfolding into the Mamba network.
arXiv Detail & Related papers (2025-01-02T13:56:23Z)
Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement [54.427965535613886]
Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision. In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks.
arXiv Detail & Related papers (2024-12-21T13:43:51Z)
Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning [54.19222454702032]
Continual Learning aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge. State Space Models (SSMs) have achieved notable success in computer vision. We introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model.
arXiv Detail & Related papers (2024-11-23T06:36:16Z)
V2M: Visual 2-Dimensional Mamba for Image Representation Learning [68.51380287151927]
Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences. Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence. We propose a Visual 2-Dimensional Mamba model as a complete solution, which directly processes image tokens in the 2D space.
arXiv Detail & Related papers (2024-10-14T11:11:06Z)
Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution [42.259283231048954]
State Space Models (SSM) have shown strong representation ability in modeling long-range dependency with linear complexity. We propose a novel Hierarchical Mamba network, namely, Hi-Mamba, for image super-resolution (SR)
arXiv Detail & Related papers (2024-10-14T04:15:04Z)
Scalable Autoregressive Image Generation with Mamba [23.027439743155192]
We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. Mamba is a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B.
arXiv Detail & Related papers (2024-08-22T09:27:49Z)
Bidirectional Gated Mamba for Sequential Recommendation [56.85338055215429]
Mamba, a recent advancement, has exhibited exceptional performance in time series prediction. We introduce a new framework named Selective Gated Mamba ( SIGMA) for Sequential Recommendation. Our results indicate that SIGMA outperforms current models on five real-world datasets.
arXiv Detail & Related papers (2024-08-21T09:12:59Z)
Cross-Scan Mamba with Masked Training for Robust Spectral Imaging [51.557804095896174]
We propose the Cross-Scanning Mamba, named CS-Mamba, that employs a Spatial-Spectral SSM for global-local balanced context encoding. Experiment results show that our CS-Mamba achieves state-of-the-art performance and the masked training method can better reconstruct smooth features to improve the visual quality.
arXiv Detail & Related papers (2024-08-01T15:14:10Z)
GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model [66.35608254724566]
State-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity. However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes.
arXiv Detail & Related papers (2024-07-18T17:59:58Z)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z)
RSMamba: Remote Sensing Image Classification with State Space Model [25.32283897448209]
We introduce RSMamba, a novel architecture for remote sensing image classification. RSMamba is based on the State Space Model (SSM) and incorporates an efficient, hardware-aware design known as the Mamba. We propose a dynamic multi-path activation mechanism to augment Mamba's capacity to model non-temporal image data.
arXiv Detail & Related papers (2024-03-28T17:59:49Z)
PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition [21.761988930589727]
PlainMamba is a simple non-hierarchical state space model (SSM) designed for general visual recognition. We adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks.
arXiv Detail & Related papers (2024-03-26T13:35:10Z)
Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining [85.08169822181685]
This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks. Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models.
arXiv Detail & Related papers (2024-02-05T18:58:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.