MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
- URL: http://arxiv.org/abs/2405.18832v1
- Date: Wed, 29 May 2024 07:23:29 GMT
- Title: MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
- Authors: Taehyun Kim, Kwanseok Choi, Youngmock Cho, Jaehoon Cho, Hyuk-Jae Lee, Jaewoong Sim,
- Abstract summary: MoNDE reduces the volume of MoE parameter movement by transferring only the $textithot$ experts to the GPU.
MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups.
- Score: 15.346491299728463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) large language models (LLM) have memory requirements that often exceed the GPU memory capacity, requiring costly parameter movement from secondary memories to the GPU for expert computation. In this work, we present Mixture of Near-Data Experts (MoNDE), a near-data computing solution that efficiently enables MoE LLM inference. MoNDE reduces the volume of MoE parameter movement by transferring only the $\textit{hot}$ experts to the GPU, while computing the remaining $\textit{cold}$ experts inside the host memory device. By replacing the transfers of massive expert parameters with the ones of small activations, MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups over the existing parameter offloading frameworks for both encoder and decoder operations.
Related papers
- fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [9.956997242640728]
fMoE is a fine-grained expert offloading system for MoE serving.
We show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z) - Ultra-Sparse Memory Network [8.927205198458994]
This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations.
Our approach significantly reduces inference latency while maintaining model performance.
In experiments, the largest UltraMem we train has 20 million memory slots.
arXiv Detail & Related papers (2024-11-19T09:24:34Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter [40.616849959987555]
We introduce a novel mechanism that fine-tunes Large Language Models (LLMs) with adapters of larger size yet memory-efficient.
This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs.
We employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU.
arXiv Detail & Related papers (2024-06-07T14:49:22Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models [64.34635279436054]
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing.
We present a solution to this memory problem, in form of a new compression and execution framework called QMoE.
arXiv Detail & Related papers (2023-10-25T17:24:53Z) - EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [3.597163516372061]
EdgeMoE is an on-device inference engine tailored for mixture-of-expert (MoE) LLMs.
It achieves both memory and computational efficiency by strategically partitioning the model across the storage hierarchy.
It demonstrates substantial memory savings and performance improvements when compared to competitive baseline solutions.
arXiv Detail & Related papers (2023-08-28T06:56:08Z) - Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements.
Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation.
We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.