Attention-aware Post-training Quantization without Backpropagation
- URL: http://arxiv.org/abs/2406.13474v1
- Date: Wed, 19 Jun 2024 11:53:21 GMT
- Title: Attention-aware Post-training Quantization without Backpropagation
- Authors: Junhan Kim, Ho-young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon,
- Abstract summary: Quantization is a promising solution for deploying large-scale language models on resource-constrained devices.
Existing quantization approaches rely on gradient-based optimization.
We propose a novel PTQ algorithm that considers inter-layer dependencies without relying on backpropagation.
- Score: 11.096116957844014
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Quantization is a promising solution for deploying large-scale language models (LLMs) on resource-constrained devices. Existing quantization approaches, however, rely on gradient-based optimization, regardless of it being post-training quantization (PTQ) or quantization-aware training (QAT), which becomes problematic for hyper-scale LLMs with billions of parameters. This overhead can be alleviated via recently proposed backpropagation-free PTQ methods; however, their performance is somewhat limited by their lack of consideration of inter-layer dependencies. In this paper, we thus propose a novel PTQ algorithm that considers inter-layer dependencies without relying on backpropagation. The fundamental concept involved is the development of attention-aware Hessian matrices, which facilitates the consideration of inter-layer dependencies within the attention module. Extensive experiments demonstrate that the proposed algorithm significantly outperforms conventional PTQ methods, particularly for low bit-widths.
Related papers
- QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs.
Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost.
We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)
RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.
Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - QSpec: Speculative Decoding with Complementary Quantization Schemes [37.007621357142725]
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models.
We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding.
QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise.
arXiv Detail & Related papers (2024-10-15T05:57:51Z) - Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization [50.485788083202124]
Reinforcement Learning (RL) plays a crucial role in aligning large language models with human preferences and improving their ability to perform complex tasks.
We introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model.
Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
arXiv Detail & Related papers (2024-10-11T23:29:20Z) - Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers [10.566264033360282]
Post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile and TVs.
In this paper, we propose a novel PTQ algorithm that balances accuracy and efficiency.
arXiv Detail & Related papers (2024-02-14T05:58:43Z) - CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.
We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.
CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z) - Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models [88.80146574509195]
Quantization is a promising approach for reducing memory overhead and accelerating inference.
We propose a novel-aware quantization (ZSAQ) framework for the zero-shot quantization of various PLMs.
arXiv Detail & Related papers (2023-10-20T07:09:56Z) - PB-LLM: Partially Binarized Large Language Models [14.244537605866864]
This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression.
We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
arXiv Detail & Related papers (2023-09-29T14:35:27Z) - EPTQ: Enhanced Post-Training Quantization via Hessian-guided Network-wise Optimization [3.3998740964877463]
Quantization is a key method for deploying deep neural networks on edge devices with limited memory and computation resources.
We propose a new method for enhanced Post-Training Quantization (EPTQ) that employs a network-wise quantization optimization process.
arXiv Detail & Related papers (2023-09-20T10:50:28Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - Quantization Aware Factorization for Deep Neural Network Compression [20.04951101799232]
decomposition of convolutional and fully-connected layers is an effective way to reduce parameters and FLOP in neural networks.
A conventional post-training quantization approach applied to networks with weights yields a drop in accuracy.
This motivated us to develop an algorithm that finds decomposed approximation directly with quantized factors.
arXiv Detail & Related papers (2023-08-08T21:38:02Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Benchmarking the Reliability of Post-training Quantization: a Particular
Focus on Worst-case Performance [53.45700148820669]
Post-training quantization (PTQ) is a popular method for compressing deep neural networks (DNNs) without modifying their original architecture or training procedures.
Despite its effectiveness and convenience, the reliability of PTQ methods in the presence of some extrem cases such as distribution shift and data noise remains largely unexplored.
This paper first investigates this problem on various commonly-used PTQ methods.
arXiv Detail & Related papers (2023-03-23T02:55:50Z) - End-to-end resource analysis for quantum interior point methods and portfolio optimization [63.4863637315163]
We provide a complete quantum circuit-level description of the algorithm from problem input to problem output.
We report the number of logical qubits and the quantity/depth of non-Clifford T-gates needed to run the algorithm.
arXiv Detail & Related papers (2022-11-22T18:54:48Z) - Optimal Brain Compression: A Framework for Accurate Post-Training
Quantization and Pruning [29.284147465251685]
We introduce a new compression framework which covers both weight pruning and quantization in a unified setting.
We show that it can improve significantly upon the compression-accuracy trade-offs of existing post-training methods.
arXiv Detail & Related papers (2022-08-24T14:33:35Z) - Parameter-Parallel Distributed Variational Quantum Algorithm [7.255056332088222]
Variational quantum algorithms (VQAs) have emerged as a promising near-term technique to explore practical quantum advantage on noisy devices.
Here, we propose a parameter-parallel distributed variational quantum algorithm (PPD-VQA) to accelerate the training process by parameter-parallel training with multiple quantum processors.
The achieved results suggest that the PPD-VQA could provide a practical solution for coordinating multiple quantum processors to handle large-scale real-word applications.
arXiv Detail & Related papers (2022-07-31T15:09:12Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.