Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks
- URL: http://arxiv.org/abs/2306.04073v1
- Date: Wed, 7 Jun 2023 00:16:10 GMT
- Title: Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks
- Authors: Mohammed Nowaz Rabbani Chowdhury, Shuai Zhang, Meng Wang, Sijia Liu
and Pin-Yu Chen
- Abstract summary: In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis.
We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
- Score: 74.68583356645276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In deep learning, mixture-of-experts (MoE) activates one or few experts
(sub-networks) on a per-sample or per-token basis, resulting in significant
computation reduction. The recently proposed \underline{p}atch-level routing in
\underline{MoE} (pMoE) divides each input into $n$ patches (or tokens) and
sends $l$ patches ($l\ll n$) to each expert through prioritized routing. pMoE
has demonstrated great empirical success in reducing training and inference
costs while maintaining test accuracy. However, the theoretical explanation of
pMoE and the general MoE remains elusive. Focusing on a supervised
classification task using a mixture of two-layer convolutional neural networks
(CNNs), we show for the first time that pMoE provably reduces the required
number of training samples to achieve desirable generalization (referred to as
the sample complexity) by a factor in the polynomial order of $n/l$, and
outperforms its single-expert counterpart of the same or even larger capacity.
The advantage results from the discriminative routing property, which is
justified in both theory and practice that pMoE routers can filter
label-irrelevant patches and route similar class-discriminative patches to the
same expert. Our experimental results on MNIST, CIFAR-10, and CelebA support
our theoretical findings on pMoE's generalization and show that pMoE can avoid
learning spurious correlations.
Related papers
- Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - LocMoE: A Low-Overhead MoE for Large Language Model Training [13.153904674287546]
We propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node.
The proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers.
arXiv Detail & Related papers (2024-01-25T03:36:39Z) - On the Convergence of Federated Averaging under Partial Participation for Over-parameterized Neural Networks [13.2844023993979]
Federated learning (FL) is a widely employed distributed paradigm for collaboratively machine learning models from multiple clients without sharing local data.
In this paper, we show that FedAvg converges to a global minimum at a global rate at a global focus.
arXiv Detail & Related papers (2023-10-09T07:56:56Z) - Langevin Thompson Sampling with Logarithmic Communication: Bandits and
Reinforcement Learning [34.4255062106615]
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance.
We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches.
Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
arXiv Detail & Related papers (2023-06-15T01:16:29Z) - SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z) - Towards Understanding Why Mask-Reconstruction Pretraining Helps in
Downstream Tasks [129.1080795985234]
Mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder.
For a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch.
arXiv Detail & Related papers (2022-06-08T11:49:26Z) - StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z) - Permutation Compressors for Provably Faster Distributed Nonconvex
Optimization [68.8204255655161]
We show that the MARINA method of Gorbunov et al (2021) can be considered as a state-of-the-art method in terms of theoretical communication complexity.
Theory of MARINA to support the theory of potentially em correlated compressors, extends to the method beyond the classical independent compressors setting.
arXiv Detail & Related papers (2021-10-07T09:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.