Efficient Deweather Mixture-of-Experts with Uncertainty-aware
Feature-wise Linear Modulation
- URL: http://arxiv.org/abs/2312.16610v1
- Date: Wed, 27 Dec 2023 15:23:37 GMT
- Title: Efficient Deweather Mixture-of-Experts with Uncertainty-aware
Feature-wise Linear Modulation
- Authors: Rongyu Zhang, Yulin Luo, Jiaming Liu, Huanrui Yang, Zhen Dong, Denis
Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Yuan Du, Shanghang
Zhang
- Abstract summary: We propose an efficient Mixture-of-Experts (MoE) architecture with weight sharing across experts.
MoFME implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block.
Experiments show that our MoFME outperforms the baselines in the image restoration quality by 0.1-0.2 dB.
- Score: 44.43376913419967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mixture-of-Experts (MoE) approach has demonstrated outstanding
scalability in multi-task learning including low-level upstream tasks such as
concurrent removal of multiple adverse weather effects. However, the
conventional MoE architecture with parallel Feed Forward Network (FFN) experts
leads to significant parameter and computational overheads that hinder its
efficient deployment. In addition, the naive MoE linear router is suboptimal in
assigning task-specific features to multiple experts which limits its further
scalability. In this work, we propose an efficient MoE architecture with weight
sharing across the experts. Inspired by the idea of linear feature modulation
(FM), our architecture implicitly instantiates multiple experts via learnable
activation modulations on a single shared expert block. The proposed Feature
Modulated Expert (FME) serves as a building block for the novel
Mixture-of-Feature-Modulation-Experts (MoFME) architecture, which can scale up
the number of experts with low overhead. We further propose an
Uncertainty-aware Router (UaR) to assign task-specific features to different FM
modules with well-calibrated weights. This enables MoFME to effectively learn
diverse expert functions for multiple tasks. The conducted experiments on the
multi-deweather task show that our MoFME outperforms the baselines in the image
restoration quality by 0.1-0.2 dB and achieves SOTA-compatible performance
while saving more than 72% of parameters and 39% inference time over the
conventional MoE counterpart. Experiments on the downstream segmentation and
classification tasks further demonstrate the generalizability of MoFME to real
open-world applications.
Related papers
- PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model [30.620582168350698]
Mixture-of-Experts (MoE) has emerged as a powerful approach for scaling transformers with improved resource utilization.
Inspired by recent works on -Efficient Fine-Tuning (PEFT), we present a unified framework for integrating PEFT modules directly into the MoE mechanism.
arXiv Detail & Related papers (2024-11-12T22:03:37Z) - Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - Scalable Multi-Domain Adaptation of Language Models using Modular Experts [10.393155077703653]
MoDE is a mixture-of-experts architecture that augments a general PLM with modular, domain-specialized experts.
MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance.
arXiv Detail & Related papers (2024-10-14T06:02:56Z) - Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models [24.915387910764082]
Expert-Specialized Fine-Tuning, or ESFT, tunes the experts most relevant to downstream tasks while freezing the other experts and modules.
MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks.
arXiv Detail & Related papers (2024-07-02T03:11:13Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.
generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks.
We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z) - Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
MoE for Instruction Tuning [7.094820944028638]
We propose an extremely parameter-efficient MoE by combining MoE architecture with lightweight experts.
Our method generalizes to unseen tasks as it does not depend on any prior task knowledge.
Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints.
arXiv Detail & Related papers (2023-09-11T13:31:00Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.