Related papers: Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

URL: http://arxiv.org/abs/2510.09094v1
Date: Fri, 10 Oct 2025 07:42:27 GMT
Title: Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation
Authors: Youwei Zheng, Yuxi Ren, Xin Xia, Xuefeng Xiao, Xiaohua Xie,
Abstract summary: We transform a dense Diffusion Transformer (DiT) into a Mixture of Experts (MoE) for structured sparsification.<n>Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5%.<n>Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.
Score: 41.16959587963631
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5\%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60\% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.

Related papers

Elastic Diffusion Transformer [32.62353162897611]
Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive.<n>We propose textbfElastic Diffusion Transformer (E-DiT), an adaptive acceleration framework for DiT.
arXiv Detail & Related papers (2026-02-15T05:19:17Z)
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers [10.251154683874033]
Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs.<n>We propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures.
arXiv Detail & Related papers (2025-11-20T08:53:07Z)
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token.<n>We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z)
BEExformer: A Fast Inferencing Binarized Transformer with Early Exits [2.7651063843287718]
We introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with Early Exit (EE)<n>BAT employs a differentiable second-order approximation to the sign function, enabling gradient that captures both the sign and magnitude of the weights.<n>EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation.<n>This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the "overthinking" problem inherent in deep networks.
arXiv Detail & Related papers (2024-12-06T17:58:14Z)
TinyFusion: Diffusion Transformers Learned Shallow [52.96232442322824]
Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization.<n>We present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning.<n>Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$times$ speedup with an FID score of 2.86.
arXiv Detail & Related papers (2024-12-02T07:05:39Z)
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers [30.88764351013966]
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains.<n>Recent works have observed redundancy within transformer blocks and developed compression methods by structured pruning of less important blocks.<n>We propose FuseGPT, a novel methodology designed to recycle pruned transformer blocks, thereby recovering the model's performance.
arXiv Detail & Related papers (2024-11-21T09:49:28Z)
An Analysis on Quantizing Diffusion Transformers [19.520194468481655]
Post Training Quantization (PTQ) offers an immediate remedy for a smaller storage size and more memory-efficient computation during inferencing. We propose a single-step sampling calibration on activations and adapt group-wise quantization on weights for low-bit quantization.
arXiv Detail & Related papers (2024-06-16T23:18:35Z)
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation. DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z)
From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models. Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.