Related papers: ZigMa: A DiT-style Zigzag Mamba Diffusion Model

ZigMa: A DiT-style Zigzag Mamba Diffusion Model

URL: http://arxiv.org/abs/2403.13802v3
Date: Sun, 24 Nov 2024 14:25:05 GMT
Title: ZigMa: A DiT-style Zigzag Mamba Diffusion Model
Authors: Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Schusterbauer, Björn Ommer,
Abstract summary: We aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. We introduce a simple, plug-and-play, zero- parameter method named Zigzag Mamba, which outperforms Mamba-based baselines. We integrate Zigzag Mamba with Interpolant framework to investigate the scalability of the model on large-resolution visual datasets.
Score: 22.68317748373856
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$ . Code will be released at https://taohu.me/zigma/

Related papers

DefMamba: Deformable Visual State Space Model [65.50381013020248]
We propose a novel visual foundation model called DefMamba. By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks.
arXiv Detail & Related papers (2025-04-08T08:22:54Z)
RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing [47.536214063122515]
RoMA is a framework that enables scalable self-supervised pretraining of RS foundation models using large-scale, diverse, unlabeled data.<n>RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy.<n> experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency.
arXiv Detail & Related papers (2025-03-13T14:09:18Z)
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. We propose the MobileMamba framework, which balances efficiency and performance. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z)
Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning [54.19222454702032]
Continual Learning aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge. State Space Models (SSMs) have achieved notable success in computer vision. We introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model.
arXiv Detail & Related papers (2024-11-23T06:36:16Z)
KMM: Key Frame Mask Mamba for Extended Motion Generation [21.144913854895243]
Key frame Masking Modeling is a novel architecture featuring Key frame Masking Modeling to enhance Mamba's focus on key actions in motion segments. We conduct extensive experiments on the go-to dataset, BABEL, achieving state-of-the-art performance with a reduction of more than 57% in FID and 70% parameters compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2024-11-10T14:41:38Z)
MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation [63.73137438677585]
MaskMamba is a novel hybrid model that combines Mamba and Transformer architectures. It achieves a remarkable $54.44%$ improvement in inference speed at a resolution of $2048times 2048$ over Transformer.
arXiv Detail & Related papers (2024-09-30T04:28:55Z)
LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity. Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z)
Mamba YOLO: A Simple Baseline for Object Detection with State Space Model [10.44725284994877]
YOLO series has set a new benchmark for real-time object detectors. Transformer-based structures have emerged as the most powerful solution. However, the quadratic complexity of the self-attentive mechanism increases the computational burden. We introduce a simple yet effective baseline approach called Mamba YOLO.
arXiv Detail & Related papers (2024-06-09T15:56:19Z)
Visual Mamba: A Survey and New Outlooks [33.90213491829634]
Mamba, a recent selective structured state space model, excels in long sequence modeling. Since January 2024, Mamba has been actively applied to diverse computer vision tasks. This paper reviews visual Mamba approaches, analyzing over 200 papers.
arXiv Detail & Related papers (2024-04-29T16:51:30Z)
PointMamba: A Simple State Space Model for Point Cloud Analysis [65.59944745840866]
We propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM), from NLP to point cloud analysis tasks. Unlike traditional Transformers, PointMamba employs a linear complexity algorithm, presenting global modeling capacity while significantly reducing computational costs.
arXiv Detail & Related papers (2024-02-16T14:56:13Z)
BlackMamba: Mixture of Experts for State-Space Models [10.209192169793772]
State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks. MoE models have shown remarkable performance while significantly reducing the compute and latency costs of inference. We present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both.
arXiv Detail & Related papers (2024-02-01T07:15:58Z)
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation [16.476244833079182]
We introduce SegMamba, a novel 3D medical image textbfSegmentation textbfMamba model. SegMamba excels in whole volume feature modeling from a state space model standpoint. Experiments on the BraTS2023 dataset demonstrate the effectiveness and efficiency of our SegMamba.
arXiv Detail & Related papers (2024-01-24T16:17:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.