ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits
- URL: http://arxiv.org/abs/2601.13563v2
- Date: Wed, 21 Jan 2026 09:34:28 GMT
- Title: ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits
- Authors: Aryan Karmore,
- Abstract summary: Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved.<n>We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified quantized substrate.<n>Across language modeling benchmarks, ButterflyMoE achieves 150$times$ memory reduction at 256 experts with negligible accuracy loss.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d^2 + N \cdot d \log d)$ memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods collapse. Across language modeling benchmarks, ButterflyMoE achieves 150$\times$ memory reduction at 256 experts with negligible accuracy loss. ButterflyMoE allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.
Related papers
- Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z) - Quantum-inspired Benchmark for Estimating Intrinsic Dimension [2.0937431058291938]
Machine learning models can generalize well on real-world datasets.<n>There exist many methods for ID estimation (IDE) but their estimates vary substantially.<n>We propose a Quantum-Inspired Intrinsic-dimension Estimation (QuIIEst) benchmark.
arXiv Detail & Related papers (2025-10-01T18:03:02Z) - HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space [12.872890364287345]
HEAPr is a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts.<n>It exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters.<n>It outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks.
arXiv Detail & Related papers (2025-09-26T13:00:46Z) - ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms [21.010238822100135]
Large language models require massive memory footprints, severely limiting deployment on consumer hardware.<n> Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations.<n>We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles.
arXiv Detail & Related papers (2025-09-11T17:59:51Z) - Inertial Quadratic Majorization Minimization with Application to Kernel Regularized Learning [1.0282274843007797]
We introduce the Quadratic Majorization Minimization with Extrapolation (QMME) framework and establish its sequential convergence properties.<n>To demonstrate practical advantages, we apply QMME to large-scale kernel regularized learning problems.
arXiv Detail & Related papers (2025-07-06T05:17:28Z) - Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification [35.6247241174615]
We propose a Squeeze-and-Recalibrate (SR) block, a drop-in replacement for linear layers in deep learning models.<n>We provide theoretical guarantees that the SR block can approximate any linear mapping to arbitrary precision.<n>Our SR-MIL models consistently outperform prior methods while requiring significantly fewer parameters and no architectural changes.
arXiv Detail & Related papers (2025-05-21T13:24:47Z) - BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments [53.71158537264695]
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices.<n>We introduce textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance.
arXiv Detail & Related papers (2024-10-31T13:26:11Z) - Mixture of Parrots: Experts improve memorization more than reasoning [72.445819694797]
We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate.<n>We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
arXiv Detail & Related papers (2024-10-24T17:54:41Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - Towards large-scale quantum optimization solvers with few qubits [59.63282173947468]
We introduce a variational quantum solver for optimizations over $m=mathcalO(nk)$ binary variables using only $n$ qubits, with tunable $k>1$.
We analytically prove that the specific qubit-efficient encoding brings in a super-polynomial mitigation of barren plateaus as a built-in feature.
arXiv Detail & Related papers (2024-01-17T18:59:38Z) - Sketchy: Memory-efficient Adaptive Regularization with Frequent
Directions [22.09320263962004]
We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace.
We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner.
We show extensions of our work to Shampoo, resulting in a method competitive in quality with Shampoo and Adam, yet requiring only sub-linear memory for tracking second moments.
arXiv Detail & Related papers (2023-02-07T21:50:06Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Spectral Learning on Matrices and Tensors [74.88243719463053]
We show that tensor decomposition can pick up latent effects that are missed by matrix methods.
We also outline computational techniques to design efficient tensor decomposition methods.
arXiv Detail & Related papers (2020-04-16T22:53:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.