Fast Inference of Mixture-of-Experts Language Models with Offloading
- URL: http://arxiv.org/abs/2312.17238v1
- Date: Thu, 28 Dec 2023 18:58:13 GMT
- Title: Fast Inference of Mixture-of-Experts Language Models with Offloading
- Authors: Artyom Eliseev, Denis Mazur
- Abstract summary: We study the problem of running large MoE language models on consumer hardware with limited accelerator memory.
Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.
- Score: 0.7998559449733824
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the widespread adoption of Large Language Models (LLMs), many deep
learning practitioners are looking for strategies of running these models more
efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) - a
type of model architectures where only a fraction of model layers are active
for any given input. This property allows MoE-based language models to generate
tokens faster than their dense counterparts, but it also increases model size
due to having multiple experts. Unfortunately, this makes state-of-the-art MoE
language models difficult to run without high-end GPUs. In this work, we study
the problem of running large MoE language models on consumer hardware with
limited accelerator memory. We build upon parameter offloading algorithms and
propose a novel strategy that accelerates offloading by taking advantage of
innate properties of MoE LLMs. Using this strategy, we build can run
Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google
Colab instances.
Related papers
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model [4.6373877301731]
We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs)
We test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone.
The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models.
arXiv Detail & Related papers (2024-03-29T21:32:50Z) - Memory Augmented Language Models through Mixture of Word Experts [5.0215187938544315]
We seek to aggressively decouple learning capacity and FLOPs through Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary based routing functions and experts.
We demonstrate that MoWE performs significantly better than the T5 family of models with similar number of FLOPs in a variety of NLP tasks.
arXiv Detail & Related papers (2023-11-15T18:19:56Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud
Scale Production [7.056223012587321]
We introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models.
We are able to deploy 136x larger models with 27% less cost and significantly better quality compared to the existing solutions.
arXiv Detail & Related papers (2022-11-18T03:43:52Z) - Petals: Collaborative Inference and Fine-tuning of Large Models [78.37798144357977]
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters.
With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale.
We propose Petals $-$ a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties.
arXiv Detail & Related papers (2022-09-02T17:38:03Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.