Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation
- URL: http://arxiv.org/abs/2506.18999v1
- Date: Mon, 23 Jun 2025 18:01:19 GMT
- Title: Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation
- Authors: Yuan Yao, Yicong Hong, Difan Liu, Long Mai, Feng Liu, Jiebo Luo,
- Abstract summary: This paper introduces diffusion transformer-to-mamba distillation (T2MD) to form an efficient training pipeline.<n>We establish a diffusion self-attention and Mamba hybrid model that simultaneously achieves efficiency and global dependencies.<n>Experiments demonstrate that our training path leads to low overhead but high-quality text-to-image generation.
- Score: 65.46359545280546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quadratic computational complexity of self-attention in diffusion transformers (DiT) introduces substantial computational costs in high-resolution image generation. While the linear-complexity Mamba model emerges as a potential alternative, direct Mamba training remains empirically challenging. To address this issue, this paper introduces diffusion transformer-to-mamba distillation (T2MD), forming an efficient training pipeline that facilitates the transition from the self-attention-based transformer to the linear complexity state-space model Mamba. We establish a diffusion self-attention and Mamba hybrid model that simultaneously achieves efficiency and global dependencies. With the proposed layer-level teacher forcing and feature-based knowledge distillation, T2MD alleviates the training difficulty and high cost of a state space model from scratch. Starting from the distilled 512$\times$512 resolution base model, we push the generation towards 2048$\times$2048 images via lightweight adaptation and high-resolution fine-tuning. Experiments demonstrate that our training path leads to low overhead but high-quality text-to-image generation. Importantly, our results also justify the feasibility of using sequential and causal Mamba models for generating non-causal visual output, suggesting the potential for future exploration.
Related papers
- MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling [36.527618275553955]
We present a hybrid model MaTVLM by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers.<n>We employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the MaTVLM.<n>Remarkably, the MaTVLM achieves up to 3.6x faster inference than the teacher model while reducing GPU memory consumption by 27.5%.
arXiv Detail & Related papers (2025-03-17T17:59:01Z) - MatIR: A Hybrid Mamba-Transformer Image Restoration Model [95.17418386046054]
We propose a Mamba-Transformer hybrid image restoration model called MatIR.<n>MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features.<n>In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths.
arXiv Detail & Related papers (2025-01-30T14:55:40Z) - MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation [63.73137438677585]
MaskMamba is a novel hybrid model that combines Mamba and Transformer architectures.
It achieves a remarkable $54.44%$ improvement in inference speed at a resolution of $2048times 2048$ over Transformer.
arXiv Detail & Related papers (2024-09-30T04:28:55Z) - The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show how to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.<n>The resulting hybrid model achieves performance comparable to the original Transformer in chat benchmarks.<n>We also introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis [56.849285913695184]
Diffusion Mamba (DiM) is a sequence model for efficient high-resolution image synthesis.
DiM architecture achieves inference-time efficiency for high-resolution images.
Experiments demonstrate the effectiveness and efficiency of our DiM.
arXiv Detail & Related papers (2024-05-23T06:53:18Z) - Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge.
We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.