DSelect-k: Differentiable Selection in the Mixture of Experts with
Applications to Multi-Task Learning
- URL: http://arxiv.org/abs/2106.03760v2
- Date: Wed, 9 Jun 2021 15:25:04 GMT
- Title: DSelect-k: Differentiable Selection in the Mixture of Experts with
Applications to Multi-Task Learning
- Authors: Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Maheswaran
Sathiamoorthy, Yihua Chen, Rahul Mazumder, Lichan Hong, Ed H. Chi
- Abstract summary: State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example.
We develop DSelect-k: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation.
Our experiments indicate MoE models based on DSelect-k can achieve statistically significant improvements in predictive and expert selection performance.
- Score: 17.012443240520625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mixture-of-experts (MoE) architecture is showing promising results in
multi-task learning (MTL) and in scaling high-capacity neural networks.
State-of-the-art MoE models use a trainable sparse gate to select a subset of
the experts for each input example. While conceptually appealing, existing
sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to
convergence and statistical performance issues when training with
gradient-based methods. In this paper, we develop DSelect-k: the first,
continuously differentiable and sparse gate for MoE, based on a novel binary
encoding formulation. Our gate can be trained using first-order methods, such
as stochastic gradient descent, and offers explicit control over the number of
experts to select. We demonstrate the effectiveness of DSelect-k in the context
of MTL, on both synthetic and real datasets with up to 128 tasks. Our
experiments indicate that MoE models based on DSelect-k can achieve
statistically significant improvements in predictive and expert selection
performance. Notably, on a real-world large-scale recommender system, DSelect-k
achieves over 22% average improvement in predictive performance compared to the
Top-k gate. We provide an open-source TensorFlow implementation of our gate.
Related papers
- Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs [18.242110417706]
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model.
We show the optimality of this approach for fine-tuning tasks under certain conditions.
Our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour.
arXiv Detail & Related papers (2024-05-05T00:08:00Z) - Task-customized Masked AutoEncoder via Mixture of Cluster-conditional
Experts [104.9871176044644]
Masked Autoencoder(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training.
We propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE)
MoCE trains each expert only with semantically relevant images by using cluster-conditional gates.
arXiv Detail & Related papers (2024-02-08T03:46:32Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z) - Cramer Type Distances for Learning Gaussian Mixture Models by Gradient
Descent [0.0]
As of today, few known algorithms can fit or learn Gaussian mixture models.
We propose a distance function called Sliced Cram'er 2-distance for learning general multivariate GMMs.
These features are especially useful for distributional reinforcement learning and Deep Q Networks.
arXiv Detail & Related papers (2023-07-13T13:43:02Z) - COMET: Learning Cardinality Constrained Mixture of Experts with Trees
and Local Search [10.003251119927222]
Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains.
Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods.
We propose a new sparse gate: COMET, which relies on a novel tree-based mechanism.
arXiv Detail & Related papers (2023-06-05T12:21:42Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z) - RoMA: Robust Model Adaptation for Offline Model-based Optimization [115.02677045518692]
We consider the problem of searching an input maximizing a black-box objective function given a static dataset of input-output queries.
A popular approach to solving this problem is maintaining a proxy model that approximates the true objective function.
Here, the main challenge is how to avoid adversarially optimized inputs during the search.
arXiv Detail & Related papers (2021-10-27T05:37:12Z) - Training with Multi-Layer Embeddings for Model Reduction [0.9046327456472286]
We introduce a multi-layer embedding training architecture that trains embeddings via a sequence of linear layers.
We show that it allows reducing d by 4-8X, with a corresponding improvement in memory footprint, at given model accuracy.
arXiv Detail & Related papers (2020-06-10T02:47:40Z) - Stepwise Model Selection for Sequence Prediction via Deep Kernel
Learning [100.83444258562263]
We propose a novel Bayesian optimization (BO) algorithm to tackle the challenge of model selection in this setting.
In order to solve the resulting multiple black-box function optimization problem jointly and efficiently, we exploit potential correlations among black-box functions.
We are the first to formulate the problem of stepwise model selection (SMS) for sequence prediction, and to design and demonstrate an efficient joint-learning algorithm for this purpose.
arXiv Detail & Related papers (2020-01-12T09:42:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.