MoFE: Mixture of Frozen Experts Architecture
- URL: http://arxiv.org/abs/2503.06491v1
- Date: Sun, 09 Mar 2025 07:24:36 GMT
- Title: MoFE: Mixture of Frozen Experts Architecture
- Authors: Jean Seo, Jaeyoon Kim, Hyopil Shin,
- Abstract summary: MoFE architecture integrates Fine-tuning (PEFT) and the Mixture of Experts (MoE) architecture to enhance both training efficiency and model scalability.<n>By freezing the Feed Forward Network layers within the MoE framework, MoFE significantly reduces the number of trainable parameters, improving training efficiency while still allowing for effective knowledge transfer from the expert models.<n>We conduct experiments to evaluate the trade-offs between performance and efficiency, compare MoFE with other PEFT methodologies, assess the impact of domain expertise in the constituent models, and determine the optimal training strategy.
- Score: 0.3959905439285648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the Mixture of Frozen Experts (MoFE) architecture, which integrates Parameter-efficient Fine-tuning (PEFT) and the Mixture of Experts (MoE) architecture to enhance both training efficiency and model scalability. By freezing the Feed Forward Network (FFN) layers within the MoE framework, MoFE significantly reduces the number of trainable parameters, improving training efficiency while still allowing for effective knowledge transfer from the expert models. This facilitates the creation of models proficient in multiple domains. We conduct experiments to evaluate the trade-offs between performance and efficiency, compare MoFE with other PEFT methodologies, assess the impact of domain expertise in the constituent models, and determine the optimal training strategy. The results show that, although there may be some trade-offs in performance, the efficiency gains are substantial, making MoFE a reasonable solution for real-world, resource-constrained environments.
Related papers
- S'MoRE: Structural Mixture of Residual Experts for LLM Fine-tuning [17.579948649237497]
We propose Structural Mixture of Residual Experts (S'MoRE), a novel framework that seamlessly integrates the efficiency of LoRA with the flexibility of MoE.
Specifically, S'MoRE employs hierarchical low-rank decomposition of expert weights, yielding residuals of varying orders interconnected in a multi-layer structure.
We prove that S'MoRE improves "structural flexibility" of traditional MoE (or Mixture-of-LoRA) by exponential order.
arXiv Detail & Related papers (2025-04-08T20:54:00Z) - OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning [3.8813502422318127]
Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT)<n>We first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency.<n>Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE)<n>Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models.
arXiv Detail & Related papers (2025-01-17T09:27:08Z) - PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model [30.620582168350698]
Mixture-of-Experts (MoE) has emerged as a powerful approach for scaling transformers with improved resource utilization.
Inspired by recent works on -Efficient Fine-Tuning (PEFT), we present a unified framework for integrating PEFT modules directly into the MoE mechanism.
arXiv Detail & Related papers (2024-11-12T22:03:37Z) - ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts [71.11994027685974]
We study the potential of applying MoE to vision through a comprehensive study on image classification and semantic segmentation.
We observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design.
We introduce a shared expert to learn and capture common knowledge, serving as an effective way to construct stable ViMoE.
arXiv Detail & Related papers (2024-10-21T07:51:17Z) - Scalable Multi-Domain Adaptation of Language Models using Modular Experts [10.393155077703653]
MoDE is a mixture-of-experts architecture that augments a general PLM with modular, domain-specialized experts.
MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance.
arXiv Detail & Related papers (2024-10-14T06:02:56Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Enhancing Fast Feed Forward Networks with Load Balancing and a Master Leaf Node [49.08777822540483]
Fast feedforward networks (FFFs) exploit the observation that different regions of the input space activate distinct subsets of neurons in wide networks.
We propose the incorporation of load balancing and Master Leaf techniques into the FFF architecture to improve performance and simplify the training process.
arXiv Detail & Related papers (2024-05-27T05:06:24Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - Efficient Deweather Mixture-of-Experts with Uncertainty-aware
Feature-wise Linear Modulation [44.43376913419967]
We propose an efficient Mixture-of-Experts (MoE) architecture with weight sharing across experts.
MoFME implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block.
Experiments show that our MoFME outperforms the baselines in the image restoration quality by 0.1-0.2 dB.
arXiv Detail & Related papers (2023-12-27T15:23:37Z) - Training Deep Energy-Based Models with f-Divergence Minimization [113.97274898282343]
Deep energy-based models (EBMs) are very flexible in distribution parametrization but computationally challenging.
We propose a general variational framework termed f-EBM to train EBMs using any desired f-divergence.
Experimental results demonstrate the superiority of f-EBM over contrastive divergence, as well as the benefits of training EBMs using f-divergences other than KL.
arXiv Detail & Related papers (2020-03-06T23:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.