Related papers: Quantization of Large Language Models with an Overdetermined Basis

Quantization of Large Language Models with an Overdetermined Basis

URL: http://arxiv.org/abs/2404.09737v1
Date: Mon, 15 Apr 2024 12:38:46 GMT
Title: Quantization of Large Language Models with an Overdetermined Basis
Authors: Daniil Merkulov, Daria Cherniuk, Alexander Rudikov, Ivan Oseledets, Ekaterina Muravleva, Aleksandr Mikhalev, Boris Kashin,
Abstract summary: We introduce an algorithm for data quantization based on the principles of Kashin representation. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
Score: 73.79368761182998
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks and on a set of downstream tasks for text classification. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance while ensuring data compression, marking a significant advancement in the field of data quantization.

Related papers

An Efficient Quantum Classifier Based on Hamiltonian Representations [50.467930253994155]
Quantum machine learning (QML) is a discipline that seeks to transfer the advantages of quantum computing to data-driven tasks. We propose an efficient approach that circumvents the costs associated with data encoding by mapping inputs to a finite set of Pauli strings. We evaluate our approach on text and image classification tasks, against well-established classical and quantum models.
arXiv Detail & Related papers (2025-04-13T11:49:53Z)
Graphical Stabilizer Decompositions for Multi-Control Toffoli Gate Dense Quantum Circuits [0.0]
We study concepts in quantum computing using graphical languages, specifically using the ZX-calculus. The first major focus is on the decomposition of non-stabilizer states created from star edges. The second major focus is on weighting algorithms, applied to the special class of multi-control Toffoli gate dense quantum circuits.
arXiv Detail & Related papers (2025-03-05T16:07:21Z)
Optimal Symbolic Construction of Matrix Product Operators and Tree Tensor Network Operators [0.0]
This research introduces an improved framework for constructing matrix product operators (MPOs) and tree tensor network operators (TTNOs) A given (Hamiltonian) operator typically has a known symbolic "sum of operator strings" form that can be translated into a tensor network structure.
arXiv Detail & Related papers (2025-02-25T20:33:30Z)
Memory-Efficient 4-bit Preconditioned Stochastic Optimization [53.422307389223626]
We introduce 4-bit quantization for Shampoo's preconditioners. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners. We demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance.
arXiv Detail & Related papers (2024-12-14T03:32:54Z)
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations [36.63586957377984]
Large language models often require substantial storage space. Due to their massive parameter count, these models often require substantial storage space. One research direction proposes to compress the models using integer replacements for floating-point numbers.
arXiv Detail & Related papers (2024-10-17T04:35:57Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Regularized Vector Quantization for Tokenized Image Synthesis [126.96880843754066]
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while quantization suffers from low codebook utilization and reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate perturbed above issues effectively by applying regularization from two perspectives.
arXiv Detail & Related papers (2023-03-11T15:20:54Z)
Quantized Sparse Weight Decomposition for Neural Network Compression [12.24566619983231]
We show that this approach can be seen as a unification of weight SVD, vector quantization, and sparse PCA. Our method is applicable to both moderate compression regimes, unlike vector quantization, and extreme compression regimes.
arXiv Detail & Related papers (2022-07-22T12:40:03Z)
Learning a Compressive Sensing Matrix with Structural Constraints via Maximum Mean Discrepancy Optimization [17.104994036477308]
We introduce a learning-based algorithm to obtain a measurement matrix for compressive sensing related recovery problems. Recent success of such metrics in neural network related topics motivate a solution of the problem based on machine learning.
arXiv Detail & Related papers (2021-10-14T08:35:54Z)
Quantum Algorithms for Data Representation and Analysis [68.754953879193]
We provide quantum procedures that speed-up the solution of eigenproblems for data representation in machine learning. The power and practical use of these subroutines is shown through new quantum algorithms, sublinear in the input matrix's size, for principal component analysis, correspondence analysis, and latent semantic analysis. Results show that the run-time parameters that do not depend on the input's size are reasonable and that the error on the computed model is small, allowing for competitive classification performances.
arXiv Detail & Related papers (2021-04-19T00:41:43Z)
Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks [70.0243910593064]
Key to success of vector quantization is deciding which parameter groups should be compressed together. In this paper we make the observation that the weights of two adjacent layers can be permuted while expressing the same function. We then establish a connection to rate-distortion theory and search for permutations that result in networks that are easier to compress.
arXiv Detail & Related papers (2020-10-29T15:47:26Z)
Embedding Compression with Isotropic Iterative Quantization [40.567720430910725]
Continuous representation of words is a standard component in deep learning-based NLP models. We propose an isotropic iterative quantization (IIQ) approach for compressing embedding vectors into binary ones.
arXiv Detail & Related papers (2020-01-11T20:53:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.