Related papers: Matryoshka Quantization

Matryoshka Quantization

URL: http://arxiv.org/abs/2502.06786v1
Date: Mon, 10 Feb 2025 18:59:10 GMT
Title: Matryoshka Quantization
Authors: Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati,
Abstract summary: Quantizing model weights is critical for reducing the communication and inference costs of large models.<n>Matryoshka Quantization (MatQuant) is a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models.<n>MatQuant can be up to $10%$ more accurate than standard int2 quantization.
Score: 19.46665026740268
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to $10\%$ more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.

Related papers

Low-bit Model Quantization for Deep Neural Networks: A Survey [123.89598730307208]
This article surveys the recent five-year progress towards low-bit quantization on deep neural networks (DNNs)<n>We discuss and compare the state-of-the-art quantization methods and classify them into 8 main categories and 24 sub-categories according to their core techniques.<n>We shed light on the potential research opportunities in the field of model quantization.
arXiv Detail & Related papers (2025-05-08T13:26:19Z)
Improving Quantization with Post-Training Model Expansion [0.35377121774178694]
Post-training model expansion is a viable strategy to improve model quality within a quantization co-design space. We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining.
arXiv Detail & Related papers (2025-03-21T19:56:59Z)
ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization [58.84018707089315]
We present a unified framework for rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. We show that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
arXiv Detail & Related papers (2025-02-04T18:59:26Z)
FP=xINT:A Low-Bit Series Expansion Algorithm for Post-Training Quantization [3.560046736432574]
Post-Training Quantization (PTQ) converts pre-trained Full-Precision (FP) models into quantized versions without training. Existing methods significantly degrade performance and quantization efficiency at extremely low settings due to quantization noise. We introduce a deep model series expansion framework to address this issue, enabling rapid and accurate approximation of unquantized models without calibration sets or fine-tuning.
arXiv Detail & Related papers (2024-12-09T08:50:28Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [63.89099994367657]
Large language models (LLMs) show impressive performance in solving complex language tasks.<n>LLMs to low bits can enable them to run on resource-constrained devices, often leading to performance degradation.<n>We propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization.
arXiv Detail & Related papers (2024-10-30T11:16:04Z)
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness [10.196942053244468]
Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks. MoQE is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance.
arXiv Detail & Related papers (2023-10-03T20:11:23Z)
Understanding the Impact of Post-Training Quantization on Large Language Models [0.38073142980732994]
The study identifies nf4 as displaying greater resilience to temperature variations in the case of the llama2 series of models at lower temperature. Int8 quantization is associated with significantly slower inference speeds, whereas unquantized bfloat16 models consistently yield the fastest inference speeds across models of all sizes.
arXiv Detail & Related papers (2023-09-11T02:58:32Z)
FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search [50.07268323597872]
We propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models. With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods. For the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
arXiv Detail & Related papers (2023-08-07T04:17:19Z)
MRQ:Support Multiple Quantization Schemes through Model Re-Quantization [0.17499351967216337]
Deep learning models cannot be easily quantized for diverse fixed-point hardwares. New type of model quantization approach called model re-quantization is proposed. Models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.
arXiv Detail & Related papers (2023-08-01T08:15:30Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Modular Quantization-Aware Training for 6D Object Pose Estimation [52.9436648014338]
Edge applications demand efficient 6D object pose estimation on resource-constrained embedded platforms. We introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy. MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques.
arXiv Detail & Related papers (2023-03-12T21:01:54Z)
Analysis of Quantization on MLP-based Vision Models [36.510879540365636]
Quantization is taken as a model compression technique, which obtains efficient models by converting floating-point weights and activations in the neural network into lower-bit integers. We show in the paper that directly applying quantization to bounded-based models will lead to significant accuracy.
arXiv Detail & Related papers (2022-09-14T02:55:57Z)
One Model for All Quantization: A Quantized Network Supporting Hot-Swap Bit-Width Adjustment [36.75157407486302]
We propose a method to train a model for all quantization that supports diverse bit-widths. We use wavelet decomposition and reconstruction to increase the diversity of weights. Our method can achieve accuracy comparable to dedicated models trained at the same precision.
arXiv Detail & Related papers (2021-05-04T08:10:50Z)
Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z)
Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models. We show negligible WER change as compared to the full-precision baseline models. Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z)
Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides. We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models. Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.