Related papers: COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

URL: http://arxiv.org/abs/2306.02824v1
Date: Mon, 5 Jun 2023 12:21:42 GMT
Title: COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search
Authors: Shibal Ibrahim, Wenyu Chen, Hussein Hazimeh, Natalia Ponomareva, Zhe Zhao, Rahul Mazumder
Abstract summary: Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains. Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods. We propose a new sparse gate: COMET, which relies on a novel tree-based mechanism.
Score: 10.003251119927222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains, such as natural language processing and vision. Sparse-MoEs select a subset of the "experts" (thus, only a portion of the overall network) for each input sample using a sparse, trainable gate. Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods. In this paper, we introduce two improvements to current MoE approaches. First, we propose a new sparse gate: COMET, which relies on a novel tree-based mechanism. COMET is differentiable, can exploit sparsity to speed up computation, and outperforms state-of-the-art gates. Second, due to the challenging combinatorial nature of sparse expert selection, first-order methods are typically prone to low-quality solutions. To deal with this challenge, we propose a novel, permutation-based local search method that can complement first-order methods in training any sparse gate, e.g., Hash routing, Top-k, DSelect-k, and COMET. We show that local search can help networks escape bad initializations or solutions. We performed large-scale experiments on various domains, including recommender systems, vision, and natural language processing. On standard vision and recommender systems benchmarks, COMET+ (COMET with local search) achieves up to 13% improvement in ROC AUC over popular gates, e.g., Hash routing and Top-k, and up to 9% over prior differentiable gates e.g., DSelect-k. When Top-k and Hash gates are combined with local search, we see up to $100\times$ reduction in the budget needed for hyperparameter tuning. Moreover, for language modeling, our approach improves over the state-of-the-art MoEBERT model for distilling BERT on 5/7 GLUE benchmarks as well as SQuAD dataset.

Related papers

An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
Learning To Dive In Branch And Bound [95.13209326119153]
We propose L2Dive to learn specific diving structurals with graph neural networks. We train generative models to predict variable assignments and leverage the duality of linear programs to make diving decisions.
arXiv Detail & Related papers (2023-01-24T12:01:45Z)
Learning to Compare Nodes in Branch and Bound with Graph Neural Networks [5.08128537391027]
Branch-and-bound approaches in integer programming require ordering portions of the space to explore next. We propose a new siamese graph neural network model to tackle this problem, where the nodes are represented as bipartite graphs with attributes. We evaluate our method by solving the instances in a plain framework where the nodes are explored according to their rank.
arXiv Detail & Related papers (2022-10-30T19:38:23Z)
A Metaheuristic Algorithm for Large Maximum Weight Independent Set Problems [58.348679046591265]
Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum. Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges. We develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search framework.
arXiv Detail & Related papers (2022-03-28T21:34:16Z)
Flipping the switch on local exploration: Genetic Algorithms with Reversals [0.0]
Authors show that gradient-free search techniques are suitable for providing an optimal solution in the discrete domain. They also show that the use of multiple local searches can improve performance on local searches. It is observed that the proposed GA variants have the least average cost across all benchmarks including the problem proposed and IC performs better than its constituents.
arXiv Detail & Related papers (2022-02-02T08:27:11Z)
Enhanced Exploration in Neural Feature Selection for Deep Click-Through Rate Prediction Models via Ensemble of Gating Layers [7.381829794276824]
The goal of neural feature selection (NFS) is to choose a relatively small subset of features with the best explanatory power. Gating approach inserts a set of differentiable binary gates to drop less informative features. To improve the exploration capacity of gradient-based solutions, we propose a simple but effective ensemble learning approach.
arXiv Detail & Related papers (2021-12-07T04:37:05Z)
ZARTS: On Zero-order Optimization for Neural Architecture Search [94.41017048659664]
Differentiable architecture search (DARTS) has been a popular one-shot paradigm for NAS due to its high efficiency. This work turns to zero-order optimization and proposes a novel NAS scheme, called ZARTS, to search without enforcing the above approximation. In particular, results on 12 benchmarks verify the outstanding robustness of ZARTS, where the performance of DARTS collapses due to its known instability issue.
arXiv Detail & Related papers (2021-10-10T09:35:15Z)
DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning [17.012443240520625]
State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. We develop DSelect-k: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. Our experiments indicate MoE models based on DSelect-k can achieve statistically significant improvements in predictive and expert selection performance.
arXiv Detail & Related papers (2021-06-07T16:25:27Z)
ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse Coding [86.40042104698792]
We formulate neural architecture search as a sparse coding problem. In experiments, our two-stage method on CIFAR-10 requires only 0.05 GPU-day for search. Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.
arXiv Detail & Related papers (2020-10-13T04:34:24Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
GOLD-NAS: Gradual, One-Level, Differentiable [100.12492801459105]
We propose a novel algorithm named Gradual One-Level Differentiable Neural Architecture Search (GOLD-NAS) It introduces a variable resource constraint to one-level optimization so that the weak operators are gradually pruned out from the super-network.
arXiv Detail & Related papers (2020-07-07T10:37:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.