QTIP: Quantization with Trellises and Incoherence Processing
- URL: http://arxiv.org/abs/2406.11235v3
- Date: Tue, 29 Oct 2024 18:55:48 GMT
- Title: QTIP: Quantization with Trellises and Incoherence Processing
- Authors: Albert Tseng, Qingyao Sun, David Hou, Christopher De Sa,
- Abstract summary: Post-training quantization (PTQ) reduces the memory footprint of LLMs.
Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once.
We introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization.
- Score: 29.917017118524246
- License:
- Abstract: Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.
Related papers
- GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed.
Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting.
GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z) - LCQ: Low-Rank Codebook based Quantization for Large Language Models [12.004172212239848]
We propose low-rank codebook based quantization for large language models.
Experiments show LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.
arXiv Detail & Related papers (2024-05-31T16:21:05Z) - GPTVQ: The Blessing of Dimensionality for LLM Quantization [16.585681547799762]
We show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality.
We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs)
Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE.
arXiv Detail & Related papers (2024-02-23T13:39:16Z) - QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks [37.66253003964376]
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision.
We introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes.
Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference.
arXiv Detail & Related papers (2024-02-06T20:52:12Z) - Weight Re-Mapping for Variational Quantum Algorithms [54.854986762287126]
We introduce the concept of weight re-mapping for variational quantum circuits (VQCs)
We employ seven distinct weight re-mapping functions to assess their impact on eight classification datasets.
Our results indicate that weight re-mapping can enhance the convergence speed of the VQC.
arXiv Detail & Related papers (2023-06-09T09:42:21Z) - TeD-Q: a tensor network enhanced distributed hybrid quantum machine
learning framework [59.07246314484875]
TeD-Q is an open-source software framework for quantum machine learning.
It seamlessly integrates classical machine learning libraries with quantum simulators.
It provides a graphical mode in which the quantum circuit and the training progress can be visualized in real-time.
arXiv Detail & Related papers (2023-01-13T09:35:05Z) - Improving Convergence for Quantum Variational Classifiers using Weight
Re-Mapping [60.086820254217336]
In recent years, quantum machine learning has seen a substantial increase in the use of variational quantum circuits (VQCs)
We introduce weight re-mapping for VQCs, to unambiguously map the weights to an interval of length $2pi$.
We demonstrate that weight re-mapping increased test accuracy for the Wine dataset by $10%$ over using unmodified weights.
arXiv Detail & Related papers (2022-12-22T13:23:19Z) - QDrop: Randomly Dropping Quantization for Extremely Low-bit
Post-Training Quantization [54.44028700760694]
Post-training quantization (PTQ) has driven much attention to produce efficient neural networks without long-time retraining.
In this study, we pioneeringly confirm that properly incorporating activation quantization into the PTQ reconstruction benefits the final accuracy.
Based on the conclusion, a simple yet effective approach dubbed as QDROP is proposed, which randomly drops the quantization of activations during PTQ.
arXiv Detail & Related papers (2022-03-11T04:01:53Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - BRECQ: Pushing the Limit of Post-Training Quantization by Block
Reconstruction [29.040991149922615]
We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ)
We propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time.
For the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster production of quantized models.
arXiv Detail & Related papers (2021-02-10T13:46:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.