Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
- URL: http://arxiv.org/abs/2310.15961v1
- Date: Tue, 24 Oct 2023 16:03:57 GMT
- Title: Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
- Authors: Szymon Antoniak, Sebastian Jaszczur, Micha{\l} Krutul, Maciej Pi\'oro,
Jakub Krajewski, Jan Ludziejewski, Tomasz Odrzyg\'o\'zd\'z, Marek Cygan
- Abstract summary: Mixture of Experts (MoE) models increase parameter counts of Transformer models while maintaining training and inference costs.
MoE models are prone to issues like training instability and uneven expert utilization.
We propose a fully-differentiable model that retains the benefits of MoE architectures while avoiding the aforementioned difficulties.
- Score: 0.9618396291860722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the promise of Mixture of Experts (MoE) models in increasing
parameter counts of Transformer models while maintaining training and inference
costs, their application carries notable drawbacks. The key strategy of these
models is to, for each processed token, activate at most a few experts -
subsets of an extensive feed-forward layer. But this approach is not without
its challenges. The operation of matching experts and tokens is discrete, which
makes MoE models prone to issues like training instability and uneven expert
utilization. Existing techniques designed to address these concerns, such as
auxiliary losses or balance-aware matching, result either in lower model
performance or are more difficult to train. In response to these issues, we
propose Mixture of Tokens, a fully-differentiable model that retains the
benefits of MoE architectures while avoiding the aforementioned difficulties.
Rather than routing tokens to experts, this approach mixes tokens from
different examples prior to feeding them to experts, enabling the model to
learn from all token-expert combinations. Importantly, this mixing can be
disabled to avoid mixing of different sequences during inference. Crucially,
this method is fully compatible with both masked and causal Large Language
Model training and inference.
Related papers
- MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts [38.15244333975921]
Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs.
We propose MaskMoE, a method designed to enhance token-level learning by employing a routing masking technique within the MoE model.
arXiv Detail & Related papers (2024-07-13T09:22:33Z) - Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model [10.682263930467196]
The Mixture-of-Experts (MoE) has gained increasing attention in the study of Large Vision-Language Models (LVLMs)
Existing MoE methods in LVLMs encourage different experts to handle different tokens, and thus they employ a router to predict the routing for each token.
This paper proposes a novel method based on token-level gradient analysis.
arXiv Detail & Related papers (2024-06-28T13:20:17Z) - GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory [49.536752342048075]
Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models.
We propose a new fine-tuning method, GW-MoE, to address this issue.
arXiv Detail & Related papers (2024-06-18T08:03:51Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion [29.130355774088205]
FuseMoE is a mixture-of-experts framework incorporated with an innovative gating function.
Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories.
arXiv Detail & Related papers (2024-02-05T17:37:46Z) - Merging Multi-Task Models via Weight-Ensembling Mixture of Experts [64.94129594112557]
Merging Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently.
Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable.
We propose to merge most of the parameters while upscaling the Transformer layers to a weight-ensembling mixture of experts (MoE) module.
arXiv Detail & Related papers (2024-02-01T08:58:57Z) - Diversifying the Mixture-of-Experts Representation for Language Models
with Orthogonal Optimizer [62.41501243027603]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose a straightforward yet highly effective solution: OMoE, an expert entity.
arXiv Detail & Related papers (2023-10-15T07:20:28Z) - Revisiting Single-gated Mixtures of Experts [13.591354795556972]
We propose to revisit the simple single-gate MoE, which allows for more practical training.
Key to our work are (i) a base model branch acting both as an early-exit and an ensembling regularization scheme.
We show experimentally that the proposed model obtains efficiency-to-accuracy trade-offs comparable with other more complex MoE.
arXiv Detail & Related papers (2023-04-11T21:07:59Z) - On the Representation Collapse of Sparse Mixture of Experts [102.83396489230375]
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead.
It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations.
However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
arXiv Detail & Related papers (2022-04-20T01:40:19Z) - Identifying and Mitigating Spurious Correlations for Improving
Robustness in NLP Models [19.21465581259624]
Many problems can be attributed to models exploiting spurious correlations, or shortcuts between the training data and the task labels.
In this paper, we aim to automatically identify such spurious correlations in NLP models at scale.
We show that our proposed method can effectively and efficiently identify a scalable set of "shortcuts", and mitigating these leads to more robust models in multiple applications.
arXiv Detail & Related papers (2021-10-14T21:40:03Z) - Learning to Generate Noise for Multi-Attack Robustness [126.23656251512762]
Adversarial learning has emerged as one of the successful techniques to circumvent the susceptibility of existing methods against adversarial perturbations.
In safety-critical applications, this makes these methods extraneous as the attacker can adopt diverse adversaries to deceive the system.
We propose a novel meta-learning framework that explicitly learns to generate noise to improve the model's robustness against multiple types of attacks.
arXiv Detail & Related papers (2020-06-22T10:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.