QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
- URL: http://arxiv.org/abs/2502.05003v1
- Date: Fri, 07 Feb 2025 15:23:34 GMT
- Title: QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
- Authors: Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh,
- Abstract summary: QuEST is a new method for training sparse or quantized language models.
We show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions.
We provide GPU kernel support showing that models produced by QuEST can be executed efficiently.
- Score: 27.644652093888745
- License:
- Abstract: One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.
Related papers
- Nearly Lossless Adaptive Bit Switching [8.485009775430411]
Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision.
arXiv Detail & Related papers (2025-02-03T09:46:26Z) - HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs [45.37278584462772]
We present HALO, a novel quantization-aware training approach for Transformers.
Our approach ensures that all large matrix multiplications during the forward and backward passes are executed in lower precision.
Applying to LLAMA-family models, HALO achieves near-full-precision-equivalent results during fine-tuning on various tasks.
arXiv Detail & Related papers (2025-01-05T18:41:54Z) - GWQ: Gradient-Aware Weight Quantization for Large Language Models [63.89099994367657]
Large language models (LLMs) show impressive performance in solving complex language tasks.
LLMs to low bits can enable them to run on resource-constrained devices, often leading to performance degradation.
We propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - TEQ: Trainable Equivalent Transformation for Quantization of LLMs [1.0376648762140632]
We present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization.
The training process is lightweight, requiring only 1K steps and fewer than 0.1 percent of the original model's trainable parameters.
arXiv Detail & Related papers (2023-10-17T02:42:34Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - Optimal Clipping and Magnitude-aware Differentiation for Improved
Quantization-aware Training [8.106641866299377]
Current practices rely on scalars to set clipping threshold scalars and cannot be shown to be optimal.
We propose Optimally Clippeds And Vectors ( OCTAV), a algorithm to determine MSE-optimal clipping scalars.
OCTAV finds optimal clipping scalars on the fly, for every tensor, at every iteration of the quantization-aware training (QAT) routine.
arXiv Detail & Related papers (2022-06-13T22:15:21Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.