Related papers: Power-of-Two (PoT) Weights in Large Language Models (LLMs)

Power-of-Two (PoT) Weights in Large Language Models (LLMs)

URL: http://arxiv.org/abs/2506.00315v1
Date: Sat, 31 May 2025 00:01:25 GMT
Title: Power-of-Two (PoT) Weights in Large Language Models (LLMs)
Authors: Mahmoud Elgenedy,
Abstract summary: The number of model parameters has grown exponentially in the past few years, for example, from 1.5 billion parameters in GPT2 to 175 billion in GPT3.<n>This raises a significant challenge for implementation, especially for Edge devices where memory and processing power are very limited.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Complexity of Neural Networks is increasing rapidly due to the massive increase in model parameters. Specifically, in Large Language Models (LLMs), the number of model parameters has grown exponentially in the past few years, for example, from 1.5 billion parameters in GPT2 to 175 billion in GPT3. This raises a significant challenge for implementation, especially for Edge devices where memory and processing power are very limited. In this work, we investigate reducing LLM complexity with special type of quantization, power of two (PoT), for linear layers weights and transformer tables. PoT not only provides memory reduction but more importantly provides significant computational reduction through converting multiplication to bit shifting. We obtained preliminary results of PoT quantization on Nano-GPT implementation using Shakespeare dataset. We then extended results to 124-M GPT-2 model. The PoT quantization results are shown to be very promising with cross entropy loss degradation $\approx$[1.3-0.88] with number of bits range [4-6] to represent power levels.

Related papers

PoTPTQ: A Two-step Power-of-Two Post-training for LLMs [27.141872509108122]
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks.<n>Power-of-two (PoT) quantization is a general tool to counteract this difficulty.<n>We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization.
arXiv Detail & Related papers (2025-07-16T06:44:14Z)
Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z)
HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization [10.307268005739202]
Diffusion Transformers (DiTs) have recently gained substantial attention for their superior visual generation capabilities. DiTs also come with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones. We introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference.
arXiv Detail & Related papers (2024-05-30T06:56:11Z)
From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models. Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Efficient GPT Model Pre-training using Tensor Train Matrix Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure. The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z)
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [34.91478831993398]
GPTQ is a new one-shot weight quantization method based on approximate second-order information. It can quantize GPT models with 175 billion parameters in approximately four GPU hours. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods.
arXiv Detail & Related papers (2022-10-31T13:42:40Z)
n-hot: Efficient bit-level sparsity for powers-of-two neural network quantization [0.0]
Powers-of-two (PoT) quantization reduces the number of bit operations of deep neural networks on resource-constrained hardware. PoT quantization triggers a severe accuracy drop because of its limited representation ability. We propose an efficient PoT quantization scheme that balances accuracy and costs in a memory-efficient way.
arXiv Detail & Related papers (2021-03-22T10:13:12Z)
PAMS: Quantized Super-Resolution via Parameterized Max Scale [84.55675222525608]
Deep convolutional neural networks (DCNNs) have shown dominant performance in the task of super-resolution (SR) We propose a new quantization scheme termed PArameterized Max Scale (PAMS), which applies the trainable truncated parameter to explore the upper bound of the quantization range adaptively. Experiments demonstrate that the proposed PAMS scheme can well compress and accelerate the existing SR models such as EDSR and RDN.
arXiv Detail & Related papers (2020-11-09T06:16:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.