Improving MoE Compute Efficiency by Composing Weight and Data Sparsity
- URL: http://arxiv.org/abs/2601.15370v1
- Date: Wed, 21 Jan 2026 18:53:58 GMT
- Title: Improving MoE Compute Efficiency by Composing Weight and Data Sparsity
- Authors: Maciej Kilian, Oleg Mkrtchyan, Luke Zettlemoyer, Akshat Shrivastava, Armen Aghajanyan,
- Abstract summary: Mixture-of-Experts layers achieve compute efficiency through weight sparsity.<n>Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis.
- Score: 50.654297246411545
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts layers achieve compute efficiency through weight sparsity: each token activates only a subset of experts. Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis. Expert-choice routing implements data sparsity directly but violates causality in autoregressive models, creating train-inference mismatch. We recover data sparsity within causal token-choice MoE by leveraging zero-compute (null) experts within the routing pool. When a token routes to null experts, those slots consume no compute. The standard load balancing objective trains the model to uniformly use all experts (real and null) therefore creating data sparsity in expectation without the causality violations. We evaluate on vision-language model training, where data heterogeneity is pronounced: vision encoders produce many low-information tokens while text tokens are denser. At matched expected FLOPs, composing weight and data sparsity yields a more compute-efficient frontier than weight sparsity alone, with gains in training loss and downstream performance. The model learns implicit modality-aware allocation, routing vision tokens to null experts more aggressively than text, without explicit modality routing.
Related papers
- Train Once, Forget Precisely: Anchored Optimization for Efficient Post-Hoc Unlearning [0.0]
We introduce textbfForget-Aligned Model Reconstruction (FAMR), a theoretically grounded and computationally efficient framework for post-hoc unlearning in deep image classifiers.<n>FAMR frames forgetting as a constrained optimization problem that minimizes a uniformprediction loss on the forget set while anchoring model parameters to their original values.<n> Empirical results on class forgetting tasks using CIFAR-10 and ImageNet-100 FAMR's effectiveness, with strong performance retention and minimal computational overhead.
arXiv Detail & Related papers (2025-06-17T13:40:48Z) - The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.<n>XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.<n>Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - DADAgger: Disagreement-Augmented Dataset Aggregation [0.0]
DAgger is an imitation algorithm that aggregates its original datasets by querying the expert on all samples encountered during training.
We propose a modification to DAgger, known as DADAgger, which only queries the expert for state-action pairs that are out of distribution.
arXiv Detail & Related papers (2023-01-03T20:44:14Z) - How to Leverage Unlabeled Data in Offline Reinforcement Learning [125.72601809192365]
offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition.
One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data.
We find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing.
arXiv Detail & Related papers (2022-02-03T18:04:54Z) - BiFair: Training Fair Models with Bilevel Optimization [8.2509884277533]
We develop a new training algorithm, named BiFair, which jointly minimizes for a utility, and a fairness loss of interest.
Our algorithm consistently performs better, i.e., we reach to better values of a given fairness metric under same, or higher accuracy.
arXiv Detail & Related papers (2021-06-03T22:36:17Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - An Information Bottleneck Approach for Controlling Conciseness in
Rationale Extraction [84.49035467829819]
We show that it is possible to better manage this trade-off by optimizing a bound on the Information Bottleneck (IB) objective.
Our fully unsupervised approach jointly learns an explainer that predicts sparse binary masks over sentences, and an end-task predictor that considers only the extracted rationale.
arXiv Detail & Related papers (2020-05-01T23:26:41Z) - Regularization via Structural Label Smoothing [22.74769739125912]
Regularization is an effective way to promote the generalization performance of machine learning models.
In this paper, we focus on label smoothing, a form of output distribution regularization that prevents overfitting of a neural network.
We show that such label smoothing imposes a quantifiable bias in the Bayes error rate of the training data.
arXiv Detail & Related papers (2020-01-07T05:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.