DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis
- URL: http://arxiv.org/abs/2405.14224v2
- Date: Wed, 10 Jul 2024 09:02:11 GMT
- Title: DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis
- Authors: Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu,
- Abstract summary: Diffusion Mamba (DiM) is a sequence model for efficient high-resolution image synthesis.
DiM architecture achieves inference-time efficiency for high-resolution images.
Experiments demonstrate the effectiveness and efficiency of our DiM.
- Score: 56.849285913695184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have achieved great success in image generation, with the backbone evolving from U-Net to Vision Transformers. However, the computational cost of Transformers is quadratic to the number of tokens, leading to significant challenges when dealing with high-resolution images. In this work, we propose Diffusion Mamba (DiM), which combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. To address the challenge that Mamba cannot generalize to 2D signals, we make several architecture designs including multi-directional scans, learnable padding tokens at the end of each row and column, and lightweight local feature enhancement. Our DiM architecture achieves inference-time efficiency for high-resolution images. In addition, to further improve training efficiency for high-resolution image generation with DiM, we investigate "weak-to-strong" training strategy that pretrains DiM on low-resolution images ($256\times 256$) and then finetune it on high-resolution images ($512 \times 512$). We further explore training-free upsampling strategies to enable the model to generate higher-resolution images (e.g., $1024\times 1024$ and $1536\times 1536$) without further fine-tuning. Experiments demonstrate the effectiveness and efficiency of our DiM. The code of our work is available here: {\url{https://github.com/tyshiwo1/DiM-DiffusionMamba/}}.
Related papers
- Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - Scalable Autoregressive Image Generation with Mamba [23.027439743155192]
We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture.
Mamba is a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time.
We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B.
arXiv Detail & Related papers (2024-08-22T09:27:49Z) - Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.
Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z) - Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation [41.54814517077309]
We introduce a novel diffusion architecture, Diffusion Mamba (DiM), which foregoes traditional attention mechanisms in favor of a scalable alternative.
DiM achieves rapid inference times and reduced computational load, maintaining linear complexity with respect to sequence length.
Results affirm the scalability and efficiency of DiM, establishing a new benchmark for image and video generation techniques.
arXiv Detail & Related papers (2024-05-24T18:50:27Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.
DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - Matryoshka Diffusion Models [38.26966802461602]
Diffusion models are the de facto approach for generating high-quality images and videos.
We introduce Matryoshka Diffusion Models, an end-to-end framework for high-resolution image and video synthesis.
We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications.
arXiv Detail & Related papers (2023-10-23T17:20:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.