Related papers: ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

URL: http://arxiv.org/abs/2601.13563v2
Date: Wed, 21 Jan 2026 09:34:28 GMT
Title: ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits
Authors: Aryan Karmore,
Abstract summary: Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved.<n>We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified quantized substrate.<n>Across language modeling benchmarks, ButterflyMoE achieves 150$times$ memory reduction at 256 experts with negligible accuracy loss.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d^2 + N \cdot d \log d)$ memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods collapse. Across language modeling benchmarks, ButterflyMoE achieves 150$\times$ memory reduction at 256 experts with negligible accuracy loss. ButterflyMoE allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

Related papers

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z)
Quantum-inspired Benchmark for Estimating Intrinsic Dimension [2.0937431058291938]
Machine learning models can generalize well on real-world datasets.<n>There exist many methods for ID estimation (IDE) but their estimates vary substantially.<n>We propose a Quantum-Inspired Intrinsic-dimension Estimation (QuIIEst) benchmark.
arXiv Detail & Related papers (2025-10-01T18:03:02Z)
HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space [12.872890364287345]
HEAPr is a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts.<n>It exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters.<n>It outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks.
arXiv Detail & Related papers (2025-09-26T13:00:46Z)
ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms [21.010238822100135]
Large language models require massive memory footprints, severely limiting deployment on consumer hardware.<n> Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations.<n>We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles.
arXiv Detail & Related papers (2025-09-11T17:59:51Z)
Inertial Quadratic Majorization Minimization with Application to Kernel Regularized Learning [1.0282274843007797]
We introduce the Quadratic Majorization Minimization with Extrapolation (QMME) framework and establish its sequential convergence properties.<n>To demonstrate practical advantages, we apply QMME to large-scale kernel regularized learning problems.
arXiv Detail & Related papers (2025-07-06T05:17:28Z)
Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification [35.6247241174615]
We propose a Squeeze-and-Recalibrate (SR) block, a drop-in replacement for linear layers in deep learning models.<n>We provide theoretical guarantees that the SR block can approximate any linear mapping to arbitrary precision.<n>Our SR-MIL models consistently outperform prior methods while requiring significantly fewer parameters and no architectural changes.
arXiv Detail & Related papers (2025-05-21T13:24:47Z)
BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments [53.71158537264695]
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices.<n>We introduce textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance.
arXiv Detail & Related papers (2024-10-31T13:26:11Z)
Mixture of Parrots: Experts improve memorization more than reasoning [72.445819694797]
We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate.<n>We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
arXiv Detail & Related papers (2024-10-24T17:54:41Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations. A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z)
Towards large-scale quantum optimization solvers with few qubits [59.63282173947468]
We introduce a variational quantum solver for optimizations over $m=mathcalO(nk)$ binary variables using only $n$ qubits, with tunable $k>1$. We analytically prove that the specific qubit-efficient encoding brings in a super-polynomial mitigation of barren plateaus as a built-in feature.
arXiv Detail & Related papers (2024-01-17T18:59:38Z)
Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions [22.09320263962004]
We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner. We show extensions of our work to Shampoo, resulting in a method competitive in quality with Shampoo and Adam, yet requiring only sub-linear memory for tracking second moments.
arXiv Detail & Related papers (2023-02-07T21:50:06Z)
Monarch: Expressive Structured Matrices for Efficient and Accurate Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones. We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z)
Spectral Learning on Matrices and Tensors [74.88243719463053]
We show that tensor decomposition can pick up latent effects that are missed by matrix methods. We also outline computational techniques to design efficient tensor decomposition methods.
arXiv Detail & Related papers (2020-04-16T22:53:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.