Related papers: Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

URL: http://arxiv.org/abs/2402.12399v2
Date: Wed, 21 Feb 2024 13:33:12 GMT
Title: Turn Waste into Worth: Rectifying Top-$k$ Router of MoE
Authors: Zhiyuan Zeng, Qipeng Guo, Zhaoye Fei, Zhangyue Yin, Yunhua Zhou, Linyang Li, Tianxiang Sun, Hang Yan, Dahua Lin, Xipeng Qiu
Abstract summary: MoE models are popular for training large language models due to their computational efficiency. The commonly used top-$k$ routing mechanism suffers from redundancy and memory costs due to the unbalanced routing. To address the dropped tokens and padding, we propose the Rectify-ify, comprising the Intra-GPU Rectification and the Fill-in Rectification. The combination of them achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.
Score: 111.12838294273033
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Mixture of Experts (MoE) models are popular for training large language models due to their computational efficiency. However, the commonly used top-$k$ routing mechanism suffers from redundancy computation and memory costs due to the unbalanced routing. Some experts are overflow, where the exceeding tokens are dropped. While some experts are vacant, which are padded with zeros, negatively impacting model performance. To address the dropped tokens and padding, we propose the Rectify-Router, comprising the Intra-GPU Rectification and the Fill-in Rectification. The Intra-GPU Rectification handles dropped tokens, efficiently routing them to experts within the GPU where they are located to avoid inter-GPU communication. The Fill-in Rectification addresses padding by replacing padding tokens with the tokens that have high routing scores. Our experimental results demonstrate that the Intra-GPU Rectification and the Fill-in Rectification effectively handle dropped tokens and padding, respectively. Furthermore, the combination of them achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.

Related papers

Accelerating MoE Model Inference with Expert Sharding [1.4733737463429546]
Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead.<n>We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts.
arXiv Detail & Related papers (2025-03-11T14:15:01Z)
MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing [0.6445605125467574]
Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently. MoE models need to be distributed across GPU devices, thus face critical performance bottlenecks. We propose an optimal expert-to- GPU assignment that minimizes token routing costs and token processing balances across devices.
arXiv Detail & Related papers (2025-02-10T16:34:36Z)
Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model [20.979790612689992]
Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs) Existing MoE methods in LVLMs encourage different experts to handle different tokens, and they usually employ a router to predict the routing of each token. This paper proposes a novel method based on token-level gradient analysis, i.e., Solving Token Gradient Conflict (STGC)
arXiv Detail & Related papers (2024-06-28T13:20:17Z)
Finding Transformer Circuits with Edge Pruning [71.12127707678961]
We propose Edge Pruning as an effective and scalable solution to automated circuit discovery. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on.
arXiv Detail & Related papers (2024-06-24T16:40:54Z)
Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification [6.660834045805309]
Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism. We propose integrating two strategies: token pruning and token combining. Experiments with various datasets demonstrate superior performance compared to baseline models.
arXiv Detail & Related papers (2024-06-03T12:51:52Z)
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation [30.343504537684755]
Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging. To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation. We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
arXiv Detail & Related papers (2023-11-06T11:14:19Z)
PPT: Token Pruning and Pooling for Efficient Vision Transformers [7.792045532428676]
We propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT) PPT integrates both token pruning and token pooling techniques in ViTs without additional trainable parameters. It reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
arXiv Detail & Related papers (2023-10-03T05:55:11Z)
Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain. We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens. Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z)
RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer [95.71132572688143]
This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency.
arXiv Detail & Related papers (2023-04-12T07:34:13Z)
WR-ONE2SET: Towards Well-Calibrated Keyphrase Generation [57.11538133231843]
Keyphrase generation aims to automatically generate short phrases summarizing an input document. The recently emerged ONE2SET paradigm generates keyphrases as a set and has achieved competitive performance. We propose WR-ONE2SET which extends ONE2SET with an adaptive instance-level cost Weighting strategy and a target Re-assignment mechanism.
arXiv Detail & Related papers (2022-11-13T09:56:24Z)
Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers [78.77361169167149]
We propose emphGating Dropout, which allows tokens to ignore the gating network and stay at their local machines. Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance.
arXiv Detail & Related papers (2022-05-28T05:12:43Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.