Related papers: Soft Merging of Experts with Adaptive Routing

Soft Merging of Experts with Adaptive Routing

URL: http://arxiv.org/abs/2306.03745v2
Date: Mon, 13 May 2024 16:20:29 GMT
Title: Soft Merging of Experts with Adaptive Routing
Authors: Mohammed Muqeeth, Haokun Liu, Colin Raffel,
Abstract summary: We introduce Soft Merging of Experts with Adaptive Routing (SMEAR) SMEAR avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation.
Score: 38.962451264172856
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparsely activated neural networks with conditional computation learn to route their inputs through different "expert" subnetworks, providing a form of modularity that densely activated models lack. Despite their possible benefits, models with learned routing often underperform their parameter-matched densely activated counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train sparsely activated models that use non-differentiable discrete routing decisions. To address this issue, we introduce Soft Merging of Experts with Adaptive Routing (SMEAR), which avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. By routing activations through a single merged expert, SMEAR does not incur a significant increase in computational costs and enables standard gradient-based training. We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization. All of the code used in our experiments is publicly available.

Related papers

Advancing Expert Specialization for Better MoE [22.570561334474252]
Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input.<n>We observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing.<n>We propose a simple yet effective solution that introduces two complementary objectives.
arXiv Detail & Related papers (2025-05-28T13:09:47Z)
Improving Routing in Sparse Mixture of Experts with Graph of Tokens [32.46693871593765]
We unveil the limitation of Sparse Mixture of Experts (SMoE) through the perspective of the probabilistic graphical model (PGM)<n>We propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection.<n>We empirically validate our models on various tasks and domains, showing significant improvements in reducing routing fluctuations.
arXiv Detail & Related papers (2025-05-01T18:44:20Z)
Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts [33.39800923804871]
We introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens.
arXiv Detail & Related papers (2025-03-20T11:45:08Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge. Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures. We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z)
Performance Characterization of Expert Router for Scalable LLM Inference [0.4726677580049183]
Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains. deploying and serving these models at scale with optimal throughput and latency remains a significant challenge. This paper introduces Expert Router, a scalable routing architecture that directs to specialized expert models.
arXiv Detail & Related papers (2024-04-22T16:33:42Z)
Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters [69.24377241408851]
Overfitting to the source domain is a common issue in gradient-based training of deep neural networks. We propose to base the selection on gradient-signal-to-noise ratio (GSNR) of network's parameters.
arXiv Detail & Related papers (2023-10-11T10:21:34Z)
Phantom Embeddings: Using Embedding Space for Model Regularization in Deep Neural Networks [12.293294756969477]
The strength of machine learning models stems from their ability to learn complex function approximations from data. The complex models tend to memorize the training data, which results in poor regularization performance on test data. We present a novel approach to regularize the models by leveraging the information-rich latent embeddings and their high intra-class correlation.
arXiv Detail & Related papers (2023-04-14T17:15:54Z)
Model-Based Deep Learning: On the Intersection of Deep Learning and Optimization [101.32332941117271]
Decision making algorithms are used in a multitude of different applications. Deep learning approaches that use highly parametric architectures tuned from data without relying on mathematical models are becoming increasingly popular. Model-based optimization and data-centric deep learning are often considered to be distinct disciplines.
arXiv Detail & Related papers (2022-05-05T13:40:08Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Interpretable AI-based Large-scale 3D Pathloss Prediction Model for enabling Emerging Self-Driving Networks [3.710841042000923]
We propose a Machine Learning-based model that leverages novel key predictors for estimating pathloss. By quantitatively evaluating the ability of various ML algorithms in terms of predictive, generalization and computational performance, our results show that Light Gradient Boosting Machine (LightGBM) algorithm overall outperforms others.
arXiv Detail & Related papers (2022-01-30T19:50:16Z)
Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts) Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z)
Last Layer Marginal Likelihood for Invariance Learning [12.00078928875924]
We introduce a new lower bound to the marginal likelihood, which allows us to perform inference for a larger class of likelihood functions. We work towards bringing this approach to neural networks by using an architecture with a Gaussian process in the last layer.
arXiv Detail & Related papers (2021-06-14T15:40:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.