Related papers: Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression

Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression

URL: http://arxiv.org/abs/2510.18650v1
Date: Tue, 21 Oct 2025 13:58:46 GMT
Title: Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression
Authors: Kyo Kuroki, Yasuyuki Okoshi, Thiem Van Chu, Kazushi Kawamura, Masato Motomura,
Abstract summary: We propose a novel matrix quantization method, Binary Quadratic Quantization (BQQ)<n>We show that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error.<n>Our proposed method outperforms the state-of-the-art PTQ method by up to 2.2% and 59.1% on the ImageNet dataset.
Score: 2.854451361373021
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.

Related papers

Block encoding of sparse matrices with a periodic diagonal structure [67.45502291821956]
We provide an explicit quantum circuit for block encoding a sparse matrix with a periodic diagonal structure.<n>Various applications for the presented methodology are discussed in the context of solving differential problems.
arXiv Detail & Related papers (2026-02-11T07:24:33Z)
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z)
Addition is almost all you need: Compressing neural networks with double binary factorization [0.0]
Double Binary Factorization (DBF) is a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors.<n>DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods.<n>In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP# and QTIP.
arXiv Detail & Related papers (2025-05-16T10:07:36Z)
GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration [21.474315621757594]
We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures.<n>Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model.<n>GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization.
arXiv Detail & Related papers (2025-04-03T15:30:43Z)
Quantization-aware Matrix Factorization for Low Bit Rate Image Compression [8.009813033356478]
Lossy image compression is essential for efficient transmission and storage.<n>We introduce a quantization-aware matrix factorization (QMF) to develop a novel lossy image compression method.<n>Our method consistently outperforms JPEG at low bit rates below 0.25 bits per pixel (bpp) and remains comparable at higher bit rates.
arXiv Detail & Related papers (2024-08-22T19:08:08Z)
QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering [5.363038867793461]
We formulate the Quantization Error Minimization problem as minimizing the distance between a matrix before and after quantization. Matrix quantization is crucial in various applications, including Large Language Models (LLMs) weight quantization, vector databases, KV cache quantization, graph compression, and image compression. We propose Quantum Entanglement Trees (QET) to address the QEM problem by leveraging the local orderliness of matrix elements.
arXiv Detail & Related papers (2024-07-04T05:13:58Z)
2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment. It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts. We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z)
Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
Neural Network Compression using Binarization and Few Full-Precision Weights [7.206962876422061]
Automatic Prune Binarization (APB) is a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. APB delivers better accuracy/memory trade-off compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-06-15T08:52:00Z)
Gradient-descent quantum process tomography by learning Kraus operators [63.69764116066747]
We perform quantum process tomography (QPT) for both discrete- and continuous-variable quantum systems. We use a constrained gradient-descent (GD) approach on the so-called Stiefel manifold during optimization to obtain the Kraus operators. The GD-QPT matches the performance of both compressed-sensing (CS) and projected least-squares (PLS) QPT in benchmarks with two-qubit random processes.
arXiv Detail & Related papers (2022-08-01T12:48:48Z)
Mixed Precision of Quantization of Transformer Language Models for Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications. Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors. The optimal local precision settings are automatically learned using two techniques. Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z)
Efficient experimental characterization of quantum processes via compressed sensing on an NMR quantum processor [4.291616110077346]
We employ the compressed sensing (CS) algorithm and a heavily reduced data set to experimentally perform true quantum process tomography (QPT) on an NMR quantum processor. We obtain the estimate of the process matrix $chi$ corresponding to various two- and three-qubit quantum gates with a high fidelity. We also experimentally characterized the reduced dynamics of a two-qubit subsystem embedded in a three-qubit system.
arXiv Detail & Related papers (2021-09-27T17:05:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.