Related papers: Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

URL: http://arxiv.org/abs/2503.02495v3
Date: Tue, 23 Sep 2025 07:09:46 GMT
Title: Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Authors: Yujiao Yang, Jing Lian, Linhui Li,
Abstract summary: We propose Union-of-Experts (UoE), which decomposes the transformer model into an equivalent group of experts.<n>In language modeling tasks, UoE achieves an average reduction of 2.38 in perplexity compared to the best-performing MoE method.<n>In image classification, it yields an average accuracy improvement of 1.75% over the best model.
Score: 7.230514235208748
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. Conventional mixture-of-experts (MoE) architectures suffer from suboptimal coordination dynamics, where isolated expert operations expose the model to overfitting risks. Moreover, they have not been effectively extended to attention blocks, which limits further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes the transformer model into an equivalent group of experts and applies a hierarchical routing mechanism to allocate input subspaces to specialized experts. Our approach advances MoE design with four key innovations: (1) Constructing expert groups by partitioning non-MoE models into functionally equivalent specialists (2) Developing a hierarchical routing paradigm that integrates patch-wise data selection and expert selection strategies. (3) Extending the MoE design to attention blocks. (4) Proposing a hardware-optimized parallelization scheme that exploits batched matrix multiplications for efficient expert computation. The experiments demonstrate that our UoE model surpasses Full Attention, state-of-the-art MoEs and efficient transformers in several tasks across image and natural language domains. In language modeling tasks, UoE achieves an average reduction of 2.38 in perplexity compared to the best-performing MoE method with only 76% of its FLOPs. In the Long Range Arena benchmark, it demonstrates an average score at least 0.68% higher than all comparison models, with only 50% of the FLOPs of the best MoE method. In image classification, it yields an average accuracy improvement of 1.75% over the best model while maintaining comparable FLOPs. The source codes are available at https://github.com/YujiaoYang-work/UoE.

Related papers

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping [52.02659589971978]
We propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference.<n>MoDES significantly enhances inference speed, improving the prefilling time by 2.16$times$ and the decoding time by 1.26$times$.
arXiv Detail & Related papers (2025-11-19T18:48:27Z)
Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models [1.7272658301768147]
MoE-MLA-RoPE is a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling.<n>Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations.
arXiv Detail & Related papers (2025-08-02T08:33:30Z)
ViLBench: A Suite for Vision-Language Process Reward Modeling [25.565912785217822]
This paper first benchmarks current vision large language models (VLLMs) as two types of reward models. We introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. We preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models.
arXiv Detail & Related papers (2025-03-26T06:38:31Z)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models [1.4255659581428335]
We propose a novel Dynamically Allocates a variable number of experts for Mixture-of-Experts (DA-MoE) models based on an effective token importance measure. Our approach consistently outperforms the state-of-the-art Transformer based MoE model on the popular GLUE benchmark.
arXiv Detail & Related papers (2024-09-10T17:36:15Z)
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts [41.83123857437985]
Training MoEs from scratch in a large-scale regime is prohibitively expensive. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance.
arXiv Detail & Related papers (2024-08-15T17:19:12Z)
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts [95.26323548734692]
MoMa is a modality-aware mixture-of-experts architecture for pre-training mixed-modal, early-fusion language models. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings.
arXiv Detail & Related papers (2024-07-31T17:46:51Z)
Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection [19.365009652356793]
Expert-Token Resonance (ETR) is a theoretically-grounded bidirectional routing mechanism that reimagines expert-token interactions.<n>ETR achieves 5.4%-46.6% improvements in end-to-end training efficiency compared to baseline MoE implementations.
arXiv Detail & Related papers (2024-05-24T02:50:44Z)
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [33.834215393960605]
We introduce the Dynamic Mixture of Experts (DynMoE) technique to enhance the efficiency of training and inference for Transformer-based foundational models. DynMoE incorporates a novel gating method that enables each token to automatically determine the number of experts to activate. Our results demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks.
arXiv Detail & Related papers (2024-05-23T08:18:30Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training [73.90260246781435]
We present Lory, the first approach that scales such architectures to autoregressive language model pre-training. We show significant performance gains over parameter-matched dense models on both perplexity and a variety of downstream tasks. Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing.
arXiv Detail & Related papers (2024-05-06T03:06:33Z)
Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models. Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z)
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference [3.217776693788795]
We propose a lightweight optimization technique called ExFlow to largely accelerate the inference of pre-trained MoE models. By exploiting the inter-layer expert affinity, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation. Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput.
arXiv Detail & Related papers (2024-01-16T14:16:47Z)
An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution. We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture. We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture. We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.