QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
- URL: http://arxiv.org/abs/2310.16795v1
- Date: Wed, 25 Oct 2023 17:24:53 GMT
- Title: QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
- Authors: Elias Frantar and Dan Alistarh
- Abstract summary: Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing.
We present a solution to this memory problem, in form of a new compression and execution framework called QMoE.
- Score: 64.34635279436054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) architectures offer a general solution to the high
inference costs of large language models (LLMs) via sparse routing, bringing
faster and more accurate models, at the cost of massive parameter counts. For
example, the SwitchTransformer-c2048 model has 1.6 trillion parameters,
requiring 3.2TB of accelerator memory to run efficiently, which makes practical
deployment challenging and expensive. In this paper, we present a solution to
this memory problem, in form of a new compression and execution framework
called QMoE. Specifically, QMoE consists of a scalable algorithm which
accurately compresses trillion-parameter MoEs to less than 1 bit per parameter,
in a custom format co-designed with bespoke GPU decoding kernels to facilitate
efficient end-to-end compressed inference, with minor runtime overheads
relative to uncompressed execution. Concretely, QMoE can compress the 1.6
trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x
compression, 0.8 bits per parameter) at only minor accuracy loss, in less than
a day on a single GPU. This enables, for the first time, the execution of a
trillion-parameter model on affordable commodity hardware, like a single server
with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead
relative to ideal uncompressed inference. The source code and compressed models
are available at github.com/IST-DASLab/qmoe.
Related papers
- Practical offloading for fine-tuning LLM on commodity GPU via learned subspace projectors [11.938205508966808]
Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU.
We present an offloading framework, LSP_Offload, that enables near-native speed LLM fine-tuning on commodity hardware.
arXiv Detail & Related papers (2024-06-14T16:59:11Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
AQLM is first scheme that is optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter.
We provide fast GPU and CPU implementations of AQLM for token generation.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers [34.91478831993398]
GPTQ is a new one-shot weight quantization method based on approximate second-order information.
It can quantize GPT models with 175 billion parameters in approximately four GPU hours.
Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods.
arXiv Detail & Related papers (2022-10-31T13:42:40Z) - DKM: Differentiable K-Means Clustering Layer for Neural Network
Compression [20.73169804006698]
We propose a differentiable k-means clustering layer (DKM) to train-time weight clustering-based model compression.
DKM casts k-means clustering as an attention problem and enables joint optimization of the parameters and clustering centroids.
We show that DKM delivers superior compression and accuracy trade-off on ImageNet1k and GLUE benchmarks.
arXiv Detail & Related papers (2021-08-28T14:35:41Z) - Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator.
We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.