Related papers: Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective

Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective

URL: http://arxiv.org/abs/2601.11616v1
Date: Fri, 09 Jan 2026 23:07:14 GMT
Title: Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective
Authors: Feilong Liu,
Abstract summary: Mixture-of-Experts (MoE) architectures are commonly motivated by efficiency and conditional computation.<n>We study MoEs through a geometric lens, interpreting routing as a form of soft partitioning of the representation space into overlapping local charts.
Score: 0.5414847001704249
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) architectures are commonly motivated by efficiency and conditional computation, but their effect on the geometry of learned functions and representations remains poorly characterized. In this work, we study MoEs through a geometric lens, interpreting routing as a form of soft partitioning of the representation space into overlapping local charts. We introduce a Dual Jacobian-PCA Spectral Geometry probe. It analyzes local function geometry via Jacobian singular-value spectra and representation geometry via weighted PCA of routed hidden states. Using a controlled MLP-MoE setting that permits exact Jacobian computation, we compare dense, Top-k, and fully-soft routing architectures under matched capacity. Across random seeds, we observe that MoE routing consistently reduces local sensitivity, with expert-local Jacobians exhibiting smaller leading singular values and faster spectral decay than dense baselines. At the same time, weighted PCA reveals that expert-local representations distribute variance across a larger number of principal directions, indicating higher effective rank under identical input distributions. We further find that average expert Jacobians are nearly orthogonal, suggesting a decomposition of the transformation into low-overlap expert-specific subspaces rather than scaled variants of a shared map. We analyze how routing sharpness modulates these effects, showing that Top-k routing produces lower-rank, more concentrated expert-local structure, while fully-soft routing yields broader, higher-rank representations. Together, these results support a geometric interpretation of MoEs as soft partitionings of function space that flatten local curvature while redistributing representation variance.

Related papers

From Directions to Regions: Decomposing Activations in Language Models via Local Geometry [37.50120706345745]
We leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space.<n>MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid.<n>We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space.
arXiv Detail & Related papers (2026-02-02T18:49:05Z)
Understanding and Improving UMAP with Geometric and Topological Priors: The JORC-UMAP Algorithm [1.7484982792736636]
dimensionality reduction techniques, particularly UMAP, are widely used for visualizing high-dimensional data.<n>We introduce Ollivier-Ricci curvature as a geometric prior, reinforcing edges at geometric bottlenecks and reducing redundant links.<n>Experiments on synthetic and real-world datasets show that JORC-UMAP reduces tearing and collapse more effectively than standard UMAP and other DR methods.
arXiv Detail & Related papers (2026-01-23T08:42:56Z)
TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z)
MS-ISSM: Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity [65.85858856481131]
unstructured and irregular nature of point clouds poses a significant challenge for objective quality assessment (PCQA)<n>We propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM)
arXiv Detail & Related papers (2026-01-03T14:58:52Z)
GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed Graphs [59.61242815508687]
Graph neural networks (GNNs) on text--attributed graphs (TAGs) encode node texts using pretrained language models (PLMs) and propagate these embeddings through linear neighborhood aggregation.<n>This work introduces a local PCA-based metric that measures the degree of semantic drift and provides the first quantitative framework to analyze how different aggregation mechanisms affect manifold structure.
arXiv Detail & Related papers (2025-11-12T06:48:43Z)
Learning Overspecified Gaussian Mixtures Exponentially Fast with the EM Algorithm [5.625796693054093]
We investigate the convergence properties of the EM algorithm when applied to overspecified Gaussian mixture models.<n>We demonstrate that the population EM algorithm converges exponentially fast in terms of the Kullback-Leibler (KL) distance.
arXiv Detail & Related papers (2025-06-13T14:57:57Z)
Learning Mixtures of Experts with EM: A Mirror Descent Perspective [28.48469221248906]
Classical Mixtures of Experts (MoE) are Machine Learning models that involve the input space, with a separate "expert" model trained on each partition.<n>We study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models.
arXiv Detail & Related papers (2024-11-09T03:44:09Z)
IsUMap: Manifold Learning and Data Visualization leveraging Vietoris-Rips filtrations [0.08796261172196743]
We present a systematic and detailed construction of a metric representation for locally distorted metric spaces. Our approach addresses limitations in existing methods by accommodating non-uniform data distributions and intricate local geometries.
arXiv Detail & Related papers (2024-07-25T07:46:30Z)
RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching) To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth. We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z)
Adaptive Spot-Guided Transformer for Consistent Local Feature Matching [64.30749838423922]
We propose Adaptive Spot-Guided Transformer (ASTR) for local feature matching. ASTR models the local consistency and scale variations in a unified coarse-to-fine architecture.
arXiv Detail & Related papers (2023-03-29T12:28:01Z)
Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks [3.7384509727711923]
We introduce a pairwise feature for deep stereo matching networks, named LSP (Local Similarity Pattern) Through explicitly revealing the neighbor relationships, LSP contains rich structural information, which can be leveraged to aid for more discriminative feature description. Secondly, we design a dynamic self-reassembling refinement strategy and apply it to the cost distribution and the disparity map respectively.
arXiv Detail & Related papers (2021-12-02T06:52:54Z)
Making Affine Correspondences Work in Camera Geometry Computation [62.7633180470428]
Local features provide region-to-region rather than point-to-point correspondences. We propose guidelines for effective use of region-to-region matches in the course of a full model estimation pipeline. Experiments show that affine solvers can achieve accuracy comparable to point-based solvers at faster run-times.
arXiv Detail & Related papers (2020-07-20T12:07:48Z)
Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation [90.28365183660438]
This paper proposes an augmented parallel-pyramid net with attention partial module and differentiable auto-data augmentation. We define a new pose search space where the sequences of data augmentations are formulated as a trainable and operational CNN component. Notably, our method achieves the top-1 accuracy on the challenging COCO keypoint benchmark and the state-of-the-art results on the MPII datasets.
arXiv Detail & Related papers (2020-03-17T03:52:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.