Related papers: On Linear Mode Connectivity of Mixture-of-Experts Architectures

On Linear Mode Connectivity of Mixture-of-Experts Architectures

URL: http://arxiv.org/abs/2509.11348v2
Date: Sat, 25 Oct 2025 03:12:28 GMT
Title: On Linear Mode Connectivity of Mixture-of-Experts Architectures
Authors: Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, Tan M. Nguyen,
Abstract summary: We investigate the phenomenon of linear Mode Connectivity (LMC) in neural networks.<n>LMC is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been to be connected--up to varying symmetries of an algorithm.
Score: 1.6747713135100666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been observed to be connected--up to permutation symmetries--by linear paths in parameter space along which the loss remains consistently low. This observation challenges classical views of non-convex optimization and has implications for model ensembling, generalization, and our understanding of neural loss geometry. Inspired by recent studies on LMC in standard neural networks, we systematically investigate this phenomenon within Mixture-of-Experts (MoE) architectures--a class of models known for their scalability and computational efficiency, which combine traditional neural networks--referred to as experts--through a learnable gating mechanism. We begin by conducting a comprehensive analysis of both dense and sparse gating regimes, demonstrating that the symmetries inherent to MoE architectures are fully characterized by permutations acting on both the expert components and the gating function. Building on these foundational findings, we propose a matching algorithm that enables alignment between independently trained MoEs, thereby facilitating the discovery of LMC. Finally, we empirically validate the presence of LMC using our proposed algorithm across diverse MoE configurations--including dense, sparse, and shared-expert variants--under a wide range of model settings and datasets of varying scales and modalities. Our results confirm the existence of LMC in MoE architectures and offer fundamental insights into the functional landscape and optimization dynamics of deep learning models.

Related papers

Hierarchical Inference and Closure Learning via Adaptive Surrogates for ODEs and PDEs [15.38864225184245]
Inverse problems are the task of calibrating models to match data.<n>We develop a principled methodology for leveraging data from collections of distinct yet related physical systems.<n>We learn the shared unknown dynamics in the form of an ML-based closure model.
arXiv Detail & Related papers (2026-03-04T10:30:08Z)
ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns [68.61814799047956]
Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation.<n>We introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations.
arXiv Detail & Related papers (2026-02-17T11:50:58Z)
Symmetry and Generalisation in Neural Approximations of Renormalisation Transformations [11.337632710839166]
We evaluate the role of symmetry and network expressivity in the generalisation behaviour of neural networks.<n>We consider simple multilayer perceptrons (MLPs) and graph neural networks (GNNs)<n>Our results reveal a competition between symmetry constraints and expressivity, with overly complex models generalising poorly.
arXiv Detail & Related papers (2025-10-18T17:29:23Z)
Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms [55.1784306456972]
Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference.<n>We use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors.<n>We uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity.
arXiv Detail & Related papers (2025-09-28T15:13:38Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [86.76714527437383]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
Generalized Factor Neural Network Model for High-dimensional Regression [50.554377879576066]
We tackle the challenges of modeling high-dimensional data sets with latent low-dimensional structures hidden within complex, non-linear, and noisy relationships.<n>Our approach enables a seamless integration of concepts from non-parametric regression, factor models, and neural networks for high-dimensional regression.
arXiv Detail & Related papers (2025-02-16T23:13:55Z)
Learning Mixtures of Experts with EM: A Mirror Descent Perspective [28.48469221248906]
Classical Mixtures of Experts (MoE) are Machine Learning models that involve the input space, with a separate "expert" model trained on each partition.<n>We study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models.
arXiv Detail & Related papers (2024-11-09T03:44:09Z)
Symmetry-enforcing neural networks with applications to constitutive modeling [0.0]
We show how to combine state-of-the-art micromechanical modeling and advanced machine learning techniques to homogenize complex microstructures exhibiting non-linear and history dependent behaviors. The resulting homogenized model, termed smart law (SCL), enables the adoption of microly informed laws into finite element solvers at a fraction of the computational cost required by traditional concurrent multiscale approaches. In this work, the capabilities of SCLs are expanded via the introduction of a novel methodology that enforces material symmetries at the neuron level.
arXiv Detail & Related papers (2023-12-21T01:12:44Z)
An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d) This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z)
Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.