Related papers: Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts

Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts

URL: http://arxiv.org/abs/2601.03577v1
Date: Wed, 07 Jan 2026 04:45:07 GMT
Title: Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts
Authors: Ye Su, Yong Liu,
Abstract summary: Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input.<n>We build the first unified theoretical framework that derives these practices as optimal posterior approximation and prior regularization from a Bayesian perspective.<n>Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.
Score: 11.888882732753922
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input. Their core mechanisms, Top-k routing and auxiliary load balancing, remain heuristic, however, lacking a cohesive theoretical underpinning to support them. To this end, we build the first unified theoretical framework that rigorously derives these practices as optimal sparse posterior approximation and prior regularization from a Bayesian perspective, while simultaneously framing them as mechanisms to minimize routing ambiguity and maximize channel capacity from an information-theoretic perspective. We also pinpoint the inherent combinatorial hardness of routing, defining it as the NP-hard sparse subset selection problem. We rigorously prove the existence of a "Coherence Barrier"; when expert representations exhibit high mutual coherence, greedy routing strategies theoretically fail to recover the optimal expert subset. Importantly, we formally verify that imposing geometric orthogonality in the expert feature space is sufficient to narrow the divide between the NP-hard global optimum and polynomial-time greedy approximation. Our comparative analyses confirm orthogonality regularization as the optimal engineering relaxation for large-scale models. Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.

Related papers

Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts [36.26786113564521]
Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one.<n>Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyper parameters remains missing.<n>We use $L$-Stability theory to analyze the generalization of the merged model $boldsymbolx_avg$.
arXiv Detail & Related papers (2026-01-29T13:22:06Z)
Doc2AHP: Inferring Structured Multi-Criteria Decision Models via Semantic Trees with LLMs [7.026862437055361]
We propose Doc2AHP, a novel structured inference framework guided by AHP principles.<n>We introduce a multi-agent weighting mechanism coupled with an adaptive consistency optimization strategy to ensure the numerical consistency of weight allocation.<n> Empirical results demonstrate that Doc2AHP not only empowers non-expert users to construct high-quality decision models from scratch but also significantly outperforms direct generative baselines in both logical completeness and downstream task accuracy.
arXiv Detail & Related papers (2026-01-23T06:20:23Z)
Token-Level LLM Collaboration via FusionRoute [60.72307345997823]
FusionRoute is a token-level multi-LLM collaboration framework.<n>It selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution.<n>It outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.
arXiv Detail & Related papers (2026-01-08T16:53:16Z)
How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z)
The Procrustean Bed of Time Series: The Optimization Bias of Point-wise Loss [53.542743390809356]
This paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB)<n>Our analysis reveals a fundamental paradigm paradox: the more deterministic and structured the time series, the more severe the bias by point-wise loss function.<n>We present a concrete solution that simultaneously achieves both principles via DFT or DWT.
arXiv Detail & Related papers (2025-12-21T06:08:22Z)
CogDoc: Towards Unified thinking in Documents [53.41571589733423]
We propose a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization, followed by a high-resolution "Focused Thinking" phase for deep reasoning.<n>We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning approach outperforms RL with Supervised Fine-Tuning (SFT)<n>Specifically, we find that direct RL avoids the "policy conflict" observed in SFT.
arXiv Detail & Related papers (2025-12-14T12:14:17Z)
A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models [3.0247776995428945]
In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token.<n>We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure.
arXiv Detail & Related papers (2025-12-03T16:00:02Z)
Deep Unfolding: Recent Developments, Theory, and Design Guidelines [99.63555420898554]
This article provides a tutorial-style overview of deep unfolding, a framework that transforms optimization algorithms into structured, trainable ML architectures.<n>We review the foundations of optimization for inference and for learning, introduce four representative design paradigms for deep unfolding, and discuss the distinctive training schemes that arise from their iterative nature.
arXiv Detail & Related papers (2025-12-03T13:16:35Z)
CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning [14.337056020596465]
CoT-Space is a novel theoretical framework that recasts Large Language Models (LLMs) reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space.<n>We show that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting.
arXiv Detail & Related papers (2025-09-04T09:02:16Z)
When More is Less: Understanding Chain-of-Thought Length in LLMs [51.631483479081645]
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems.<n>This paper argues that longer CoTs are often presumed superior, arguing that longer is not always better.
arXiv Detail & Related papers (2025-02-11T05:28:59Z)
Generalized Schrödinger Bridge Matching [54.171931505066]
Generalized Schr"odinger Bridge (GSB) problem setup is prevalent in many scientific areas both within and without machine learning. We propose Generalized Schr"odinger Bridge Matching (GSBM), a new matching algorithm inspired by recent advances. We show that such a generalization can be cast as solving conditional optimal control, for which variational approximations can be used.
arXiv Detail & Related papers (2023-10-03T17:42:11Z)
PAC-Chernoff Bounds: Understanding Generalization in the Interpolation Regime [6.645111950779666]
This paper introduces a distribution-dependent PAC-Chernoff bound that exhibits perfect tightness for interpolators.<n>We present a unified theoretical framework revealing why certain interpolators show an exceptional generalization, while others falter.
arXiv Detail & Related papers (2023-06-19T14:07:10Z)
Optimization on manifolds: A symplectic approach [127.54402681305629]
We propose a dissipative extension of Dirac's theory of constrained Hamiltonian systems as a general framework for solving optimization problems. Our class of (accelerated) algorithms are not only simple and efficient but also applicable to a broad range of contexts.
arXiv Detail & Related papers (2021-07-23T13:43:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.