EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models
- URL: http://arxiv.org/abs/2308.14352v1
- Date: Mon, 28 Aug 2023 06:56:08 GMT
- Title: EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models
- Authors: Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, Mengwei
Xu
- Abstract summary: EdgeMoE is an on-device inference engine tailored for mixture-of-expert (MoE) LLMs.
It achieves both memory and computational efficiency by strategically partitioning the model across the storage hierarchy.
It demonstrates substantial memory savings and performance improvements when compared to competitive baseline solutions.
- Score: 3.597163516372061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) such as GPTs and LLaMa have ushered in a
revolution in machine intelligence, owing to their exceptional capabilities in
a wide range of machine learning tasks. However, the transition of LLMs from
data centers to edge devices presents a set of challenges and opportunities.
While this shift can enhance privacy and availability, it is hampered by the
enormous parameter sizes of these models, leading to impractical runtime costs.
In light of these considerations, we introduce EdgeMoE, the first on-device
inference engine tailored for mixture-of-expert (MoE) LLMs, a popular variant
of sparse LLMs that exhibit nearly constant computational complexity as their
parameter size scales. EdgeMoE achieves both memory and computational
efficiency by strategically partitioning the model across the storage
hierarchy. Specifically, non-expert weights are stored in the device's memory,
while expert weights are kept in external storage and are fetched into memory
only when they are activated. This design is underpinned by a crucial insight
that expert weights, though voluminous, are infrequently accessed due to sparse
activation patterns. To further mitigate the overhead associated with expert
I/O swapping, EdgeMoE incorporates two innovative techniques: (1) Expert-wise
bitwidth adaptation: This method reduces the size of expert weights with an
acceptable level of accuracy loss. (2) Expert management: It predicts the
experts that will be activated in advance and preloads them into the
compute-I/O pipeline, thus further optimizing the process. In empirical
evaluations conducted on well-established MoE LLMs and various edge devices,
EdgeMoE demonstrates substantial memory savings and performance improvements
when compared to competitive baseline solutions.
Related papers
- HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - Mixture of A Million Experts [1.240096657086732]
This paper introduces PEER, a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of experts.
Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off.
arXiv Detail & Related papers (2024-07-04T20:59:20Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - MELTing point: Mobile Evaluation of Language Transformers [8.238355633015068]
We explore the current state of mobile execution of Large Language Models (LLMs)
We have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device.
We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance.
arXiv Detail & Related papers (2024-03-19T15:51:21Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements.
Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation.
We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.