Sparse Backpropagation for MoE Training
- URL: http://arxiv.org/abs/2310.00811v1
- Date: Sun, 1 Oct 2023 22:43:57 GMT
- Title: Sparse Backpropagation for MoE Training
- Authors: Liyuan Liu and Jianfeng Gao and Weizhu Chen
- Abstract summary: We introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing.
Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations.
Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain.
- Score: 118.31785160874024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One defining characteristic of Mixture-of-Expert (MoE) models is their
capacity for conducting sparse computation via expert routing, leading to
remarkable scalability. However, backpropagation, the cornerstone of deep
learning, requires dense computation, thereby posting challenges in MoE
gradient computations. Here, we introduce SparseMixer, a scalable gradient
estimator that bridges the gap between backpropagation and sparse expert
routing. Unlike typical MoE training which strategically neglects certain
gradient terms for the sake of sparse computation and scalability, SparseMixer
provides scalable gradient approximations for these terms, enabling reliable
gradient estimation in MoE training. Grounded in a numerical ODE framework,
SparseMixer harnesses the mid-point method, a second-order ODE solver, to
deliver precise gradient approximations with negligible computational overhead.
Applying SparseMixer to Switch Transformer on both pre-training and machine
translation tasks, SparseMixer showcases considerable performance gain,
accelerating training convergence up to 2 times.
Related papers
- Learning Mixtures of Experts with EM [28.48469221248906]
Mixtures of Experts (MoE) are Machine Learning models that involve the input space, with a separate "expert" model trained on each partition.
We study the efficiency of the Expectation Maximization (EM) algorithm for the training of MoE models.
arXiv Detail & Related papers (2024-11-09T03:44:09Z) - Stepping Forward on the Last Mile [8.756033984943178]
We propose a series of algorithm enhancements that further reduce the memory footprint, and the accuracy gap compared to backpropagation.
Our results demonstrate that on the last mile of model customization on edge devices, training with fixed-point forward gradients is a feasible and practical approach.
arXiv Detail & Related papers (2024-11-06T16:33:21Z) - Gradient-free variational learning with conditional mixture networks [39.827869318925494]
Conditional mixture networks (CMNs) are suitable for fast, gradient-free inference and can solve complex classification tasks.
We validate this approach by training two-layer CMNs on standard benchmarks from the UCI repository.
Our method, CAVI-CMN, achieves competitive and often superior predictive accuracy compared to maximum likelihood estimation (MLE) with backpropagation.
arXiv Detail & Related papers (2024-08-29T10:43:55Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC)
We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Training Gaussian Boson Sampling Distributions [0.0]
We derive analytical gradient formulas for the GBS distribution, which can be used to train devices using standard methods.
In the case of training using a Kullback-Leibler divergence or log-likelihood cost function, we show that gradients can be computed classically, leading to fast training.
arXiv Detail & Related papers (2020-04-09T18:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.