COMET: Learning Cardinality Constrained Mixture of Experts with Trees
and Local Search
- URL: http://arxiv.org/abs/2306.02824v1
- Date: Mon, 5 Jun 2023 12:21:42 GMT
- Title: COMET: Learning Cardinality Constrained Mixture of Experts with Trees
and Local Search
- Authors: Shibal Ibrahim, Wenyu Chen, Hussein Hazimeh, Natalia Ponomareva, Zhe
Zhao, Rahul Mazumder
- Abstract summary: Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains.
Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods.
We propose a new sparse gate: COMET, which relies on a novel tree-based mechanism.
- Score: 10.003251119927222
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up
model capacity in various domains, such as natural language processing and
vision. Sparse-MoEs select a subset of the "experts" (thus, only a portion of
the overall network) for each input sample using a sparse, trainable gate.
Existing sparse gates are prone to convergence and performance issues when
training with first-order optimization methods. In this paper, we introduce two
improvements to current MoE approaches. First, we propose a new sparse gate:
COMET, which relies on a novel tree-based mechanism. COMET is differentiable,
can exploit sparsity to speed up computation, and outperforms state-of-the-art
gates. Second, due to the challenging combinatorial nature of sparse expert
selection, first-order methods are typically prone to low-quality solutions. To
deal with this challenge, we propose a novel, permutation-based local search
method that can complement first-order methods in training any sparse gate,
e.g., Hash routing, Top-k, DSelect-k, and COMET. We show that local search can
help networks escape bad initializations or solutions. We performed large-scale
experiments on various domains, including recommender systems, vision, and
natural language processing. On standard vision and recommender systems
benchmarks, COMET+ (COMET with local search) achieves up to 13% improvement in
ROC AUC over popular gates, e.g., Hash routing and Top-k, and up to 9% over
prior differentiable gates e.g., DSelect-k. When Top-k and Hash gates are
combined with local search, we see up to $100\times$ reduction in the budget
needed for hyperparameter tuning. Moreover, for language modeling, our approach
improves over the state-of-the-art MoEBERT model for distilling BERT on 5/7
GLUE benchmarks as well as SQuAD dataset.
Related papers
- An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z) - Learning To Dive In Branch And Bound [95.13209326119153]
We propose L2Dive to learn specific diving structurals with graph neural networks.
We train generative models to predict variable assignments and leverage the duality of linear programs to make diving decisions.
arXiv Detail & Related papers (2023-01-24T12:01:45Z) - Learning to Compare Nodes in Branch and Bound with Graph Neural Networks [5.08128537391027]
Branch-and-bound approaches in integer programming require ordering portions of the space to explore next.
We propose a new siamese graph neural network model to tackle this problem, where the nodes are represented as bipartite graphs with attributes.
We evaluate our method by solving the instances in a plain framework where the nodes are explored according to their rank.
arXiv Detail & Related papers (2022-10-30T19:38:23Z) - A Metaheuristic Algorithm for Large Maximum Weight Independent Set
Problems [58.348679046591265]
Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum.
Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges.
We develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search framework.
arXiv Detail & Related papers (2022-03-28T21:34:16Z) - Flipping the switch on local exploration: Genetic Algorithms with
Reversals [0.0]
Authors show that gradient-free search techniques are suitable for providing an optimal solution in the discrete domain.
They also show that the use of multiple local searches can improve performance on local searches.
It is observed that the proposed GA variants have the least average cost across all benchmarks including the problem proposed and IC performs better than its constituents.
arXiv Detail & Related papers (2022-02-02T08:27:11Z) - Enhanced Exploration in Neural Feature Selection for Deep Click-Through
Rate Prediction Models via Ensemble of Gating Layers [7.381829794276824]
The goal of neural feature selection (NFS) is to choose a relatively small subset of features with the best explanatory power.
Gating approach inserts a set of differentiable binary gates to drop less informative features.
To improve the exploration capacity of gradient-based solutions, we propose a simple but effective ensemble learning approach.
arXiv Detail & Related papers (2021-12-07T04:37:05Z) - ZARTS: On Zero-order Optimization for Neural Architecture Search [94.41017048659664]
Differentiable architecture search (DARTS) has been a popular one-shot paradigm for NAS due to its high efficiency.
This work turns to zero-order optimization and proposes a novel NAS scheme, called ZARTS, to search without enforcing the above approximation.
In particular, results on 12 benchmarks verify the outstanding robustness of ZARTS, where the performance of DARTS collapses due to its known instability issue.
arXiv Detail & Related papers (2021-10-10T09:35:15Z) - DSelect-k: Differentiable Selection in the Mixture of Experts with
Applications to Multi-Task Learning [17.012443240520625]
State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example.
We develop DSelect-k: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation.
Our experiments indicate MoE models based on DSelect-k can achieve statistically significant improvements in predictive and expert selection performance.
arXiv Detail & Related papers (2021-06-07T16:25:27Z) - ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse
Coding [86.40042104698792]
We formulate neural architecture search as a sparse coding problem.
In experiments, our two-stage method on CIFAR-10 requires only 0.05 GPU-day for search.
Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.
arXiv Detail & Related papers (2020-10-13T04:34:24Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - GOLD-NAS: Gradual, One-Level, Differentiable [100.12492801459105]
We propose a novel algorithm named Gradual One-Level Differentiable Neural Architecture Search (GOLD-NAS)
It introduces a variable resource constraint to one-level optimization so that the weak operators are gradually pruned out from the super-network.
arXiv Detail & Related papers (2020-07-07T10:37:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.