Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection
- URL: http://arxiv.org/abs/2409.15557v1
- Date: Mon, 23 Sep 2024 21:27:26 GMT
- Title: Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection
- Authors: Alireza Ganjdanesh, Yan Kang, Yuchen Liu, Richard Zhang, Zhe Lin, Heng Huang,
- Abstract summary: We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts.
We demonstrate the effectiveness of our method, DiffPruning, across several datasets.
- Score: 63.96018203905272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion probabilistic models can generate high-quality samples. Yet, their sampling process requires numerous denoising steps, making it slow and computationally intensive. We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts. First, we study the similarities between pairs of denoising timesteps, observing a natural clustering, even across different datasets. This suggests that rather than having a single model for all time steps, separate models can serve as ``experts'' for their respective time intervals. As such, we separately fine-tune the pretrained model on each interval, with elastic dimensions in depth and width, to obtain experts specialized in their corresponding denoising interval. To optimize the resource usage between experts, we introduce our Expert Routing Agent, which learns to select a set of proper network configurations. By doing so, our method can allocate the computing budget between the experts in an end-to-end manner without requiring manual heuristics. Finally, with a selected configuration, we fine-tune our pruned experts to obtain our mixture of efficient experts. We demonstrate the effectiveness of our method, DiffPruning, across several datasets, LSUN-Church, LSUN-Beds, FFHQ, and ImageNet, on the Latent Diffusion Model architecture.
Related papers
- Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging [36.0133566024214]
Upcycling Instruction Tuning (UpIT) is a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model.
To ensure each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router.
arXiv Detail & Related papers (2024-10-02T14:48:22Z) - Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Kullback-Leibler Barycentre of Stochastic Processes [0.0]
We consider the problem where an agent aims to combine the views and insights of different experts' models.
We show existence and uniqueness of the barycentre model and proof an explicit representation of the Radon-Nikodym derivative.
Two deep learning algorithms are proposed to find the optimal drift of the combined model.
arXiv Detail & Related papers (2024-07-05T20:45:27Z) - Dropout-Based Rashomon Set Exploration for Efficient Predictive
Multiplicity Estimation [15.556756363296543]
Predictive multiplicity refers to the phenomenon in which classification tasks admit multiple competing models that achieve almost-equally-optimal performance.
We propose a novel framework that utilizes dropout techniques for exploring models in the Rashomon set.
We show that our technique consistently outperforms baselines in terms of the effectiveness of predictive multiplicity metric estimation.
arXiv Detail & Related papers (2024-02-01T16:25:00Z) - Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures [12.703947839247693]
Diffusion models, emerging as powerful deep generative tools, excel in various applications.
However, their remarkable generative performance is hindered by slow training and sampling.
This is due to the necessity of tracking extensive forward and reverse diffusion trajectories.
We present a multi-stage framework inspired by our empirical findings to tackle these challenges.
arXiv Detail & Related papers (2023-12-14T17:48:09Z) - Layer Ensembles [95.42181254494287]
We introduce a method for uncertainty estimation that considers a set of independent categorical distributions for each layer of the network.
We show that the method can be further improved by ranking samples, resulting in models that require less memory and time to run.
arXiv Detail & Related papers (2022-10-10T17:52:47Z) - On the Representation Collapse of Sparse Mixture of Experts [102.83396489230375]
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead.
It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations.
However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
arXiv Detail & Related papers (2022-04-20T01:40:19Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Lifelong Mixture of Variational Autoencoders [15.350366047108103]
We propose an end-to-end lifelong learning mixture of experts.
The experts in the mixture system are jointly trained by maximizing a mixture of individual component evidence lower bounds.
The model can learn new tasks fast when these are similar to those previously learnt.
arXiv Detail & Related papers (2021-07-09T22:07:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.