DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation
- URL: http://arxiv.org/abs/2411.04168v1
- Date: Wed, 06 Nov 2024 18:59:17 GMT
- Title: DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation
- Authors: Hao Phung, Quan Dao, Trung Dao, Hoang Phan, Dimitris Metaxas, Anh Tran,
- Abstract summary: We introduce a novel state-space architecture for diffusion models.
We harness spatial and frequency information to enhance the inductive bias towards local features in input images.
- Score: 4.391439322050918
- License:
- Abstract: We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs. The codes and pretrained models are released at https://github.com/VinAIResearch/DiMSUM.git.
Related papers
- MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - Shuffle Mamba: State Space Models with Random Shuffle for Multi-Modal Image Fusion [28.543822934210404]
Multi-modal image fusion integrates complementary information from different modalities to produce enhanced and informative images.
We propose a novel Bayesian-inspired scanning strategy called Random Shuffle to eliminate biases associated with fixed sequence scanning.
We develop a testing methodology based on Monte-Carlo averaging to ensure the model's output aligns more closely with expected results.
arXiv Detail & Related papers (2024-09-03T09:12:18Z) - A Hybrid Transformer-Mamba Network for Single Image Deraining [70.64069487982916]
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions.
We introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies.
arXiv Detail & Related papers (2024-08-31T10:03:19Z) - LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity.
Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution.
Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z) - MxT: Mamba x Transformer for Image Inpainting [11.447968918063335]
Image inpainting aims to restore missing or damaged regions of images with semantically coherent content.
We introduce MxT composed of the proposed Hybrid Module (HM), which combines Mamba with the transformer in a synergistic manner.
Our HM facilitates dual-level interaction learning at both pixel and patch levels, greatly enhancing the model to reconstruct images with high quality and contextual accuracy.
arXiv Detail & Related papers (2024-07-23T02:21:11Z) - Mamba-based Light Field Super-Resolution with Efficient Subspace Scanning [48.99361249764921]
Transformer-based methods have demonstrated impressive performance in 4D light field (LF) super-resolution.
However, their quadratic complexity hinders the efficient processing of high resolution 4D inputs.
We propose a Mamba-based Light Field Super-Resolution method, named MLFSR, by designing an efficient subspace scanning strategy.
arXiv Detail & Related papers (2024-06-23T11:28:08Z) - DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis [56.849285913695184]
Diffusion Mamba (DiM) is a sequence model for efficient high-resolution image synthesis.
DiM architecture achieves inference-time efficiency for high-resolution images.
Experiments demonstrate the effectiveness and efficiency of our DiM.
arXiv Detail & Related papers (2024-05-23T06:53:18Z) - Mamba-in-Mamba: Centralized Mamba-Cross-Scan in Tokenized Mamba Model for Hyperspectral Image Classification [4.389334324926174]
This study introduces the innovative Mamba-in-Mamba (MiM) architecture for HSI classification, the first attempt of deploying State Space Model (SSM) in this task.
MiM model includes 1) A novel centralized Mamba-Cross-Scan (MCS) mechanism for transforming images into sequence-data, 2) A Tokenized Mamba (T-Mamba) encoder, and 3) A Weighted MCS Fusion (WMF) module.
Experimental results from three public HSI datasets demonstrate that our method outperforms existing baselines and state-of-the-art approaches.
arXiv Detail & Related papers (2024-05-20T13:19:02Z) - FreqMamba: Viewing Mamba from a Frequency Perspective for Image Deraining [1.6793052475826054]
Images corrupted by rain streaks often lose vital frequency information for perception, and image deraining aims to solve this issue.
Recent studies have witnessed the effectiveness and efficiency of Mamba for perceiving global and local information.
We propose FreqMamba, an effective and efficient paradigm that leverages the complementary between Mamba and frequency analysis for image deraining.
arXiv Detail & Related papers (2024-04-15T06:02:31Z) - Mutual Information-driven Triple Interaction Network for Efficient Image
Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing.
The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal.
The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.